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Preface 



Through the history the man has always hoped the boost of three main characteristics: physical, metaphysical 
and intellectual. 

From the physical viewpoint he invented and developed all kind of tools: levers, wheels, cams, pistons, etc., 
until achieving the sophisticated machines existing nowadays. 

Regarding the metaphysical aspect, the initial celebration of magical-animistic rituals led to attempts, either 
real or literary, for creating ex nihilo life: life from inert substance. The most actual approaches involve the 
cryoconservation of deceased people for them to be returned to life in the future; the generation of life at the 
laboratories by means of cells, tissues, organs, systems or individuals created from previously frozen stem cells 
is also currently aimed. 

The third aspect considered, the intellectual one, is the most interesting here. There have been multiple 
contributions, since devices that increased the calculi ability as the abacus appeared, until the later theoreti- 
cal proposals for trying to solve problems, as the Ars Magna by Ramon Lull. The first written reference of the 
Artificial Intelligence that is known is The Iliad, where Homer describes the visit of the goddess Thetis and her 
son Achilles to the workshop of Hephaestus, god of smiths: At once he was helped along by female servants 
made of gold, who moved to him. They look like living servant girls, possessing minds, hearts with intelligence, 
vocal chords, and strength. 

However, the first reference of Artificial Intelligence, as it is currently understood, can be found in the proposal 
made by J. McCarthy to the Rockefeller Foundation in 1956; this proposal hoped for funds that might support 
a month-lasting meeting of twelve researchers of the Dartmouth Summer Research Project in order to establish 
the basis of the, McCarthy-named, Artificial Intelligence. 

Although the precursors of the Artificial Intelligence (S. Ramon y Cajal, N. Wienner, D. Hebb, C. Shannon and 
J. McCulloch, among many others), come from multiple science disciplines, the true driving forces (A. Turing, 
J. von Neumann, M. Minsky, T. Godell,. . .) emerge in the second third of the XX century with the apparition of 
certain tools, the computers, capable of handling fairly complex problems. Some other scientists, as J. Hopfield 
or J. Holland, proposed at the last third of the century some biology-inspired approaches that enabled the treat- 
ment of complex problems of the real world that even might require certain adaptive ability. 

All this long and productive trend of the history of the Artificial Intelligence demanded an encyclopaedia that 
might give expression to the current situation of this multidisciplinary topic, where researches from multiple 
fields as neuroscience, computing science, cognitive sciences, exact sciences and different engineering areas 
converge. 

This work intends to provide a wide and well balanced coverage of all the points of interest that currently 
exist in the field of Artificial Intelligence, from the most theoretical fundamentals to the most recent industrial 
applications. 

Multiple researches have been contacted and several notifications have been performed in different forums 
of the scientific field dealt here. 

All the proposals have been carefully revised by the editors for balancing, as far as possible, the contributions, 
with the intention of achieving an accurately wide document that might exemplify this field. 
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A first selection was performed after the reception of all the proposals and it was later sent to three external 
expert reviewers in order to carry out a double-blind revision based on a peer review. As a result of this strict 
and complex process, and before the final acceptance, a high number of contributions (80% approximately) were 
rejected or required to be modified. 

The effort of the last two years is now believed to be worthwhile; at least this is the belief of the editors 
who, with the invaluable help of a high number of people mentioned in the acknowledgements, have managed 
to get this complete encyclopaedia off the ground. The numbers speak for themselves: 233 articles published 
that have been carried out by 442 authors from 38 different countries and also revised by 238 scientific review- 
ers. The diverse and comprehensive coverage of the disciplines directly related with the Artificial Intelligence 
is also believed to contribute to a better understanding of all the researching related to this important field of 
study. It was also intended that the contributions compiled in this work might have a considerable impact on 
the expansion and the development of the body of knowledge related to this wide field, for it to be an important 
reference source used by researchers and system developers of this area. It was hoped that the encyclopaedia 
might be an effective help in order to achieve a better understanding of concepts, problems, trends, challenges 
and opportunities related to this field of study; it should be useful for the research colleagues, for the teaching 
personnel, for the students, etc. The editors will be happy to know that this work could inspire the readers for 
contributing to new advances and discoveries in this fantastic work area that might themselves also contribute 
to a better life quality of different society aspects: productive processes, health care or any other area where a 
system or product developed by techniques and procedures of Artificial Intelligence might be used. 
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INTRODUCTION 

With the increasing demand of multimedia information 
retrieval, such as image and video retrieval from the 
Web, there is a need to find ways to train a classifier when 
the training dataset is combined with a small number of 
labelled data and a large number of unlabeled one. Tradi- 
tional supervised or unsupervised learning methods are 
not suited to solving such problems particularly when 
the problem is associated with data in a high-dimen- 
sion space. In recent years, many methods have been 
proposed that can be broadly divided into two groups: 
semi-supervised and active learning (AL). Support 
Vector Machine (SVM) has been recognized as an ef- 
ficient tool to deal with high-dimensionality problems, 
a number of researchers have proposed algorithms of 
Active Learning with SVM (ALSVM) since the turn of 
the Century. Considering their rapid development, we 
review, in this chapter, the state-of-the-art of ALSVM 
for solving classification problems. 



BACKGROUND 

The general framework of AL can be described as in 
Figure 1. It can be seen clearly that its name - active 
learning - comes from the fact that the learner can 
improve the classifier by actively choosing the "opti- 
mal" data from the potential query set Q and adding it 
into the current labeled training set L after getting its 
label during the processes. The key point of AL is its 
sample selection criteria. 

AL in the past was mainly used together with neu- 
ral network algorithm and other learning algorithms. 
Statistical AL is one classical method, in which the 
sample minimizing either the variance (D. A. Cohn, 
Ghahramani, & Jordan, 1996), bias (D. A. Cohn, 1997) 
or generalisation error (Roy & McCallum, 2001) is 
queried to the oracle. Although these methods have 



strong theoretical foundation, there are two common 
problems limiting their application: one is how to 
estimate the posterior distribution of the samples, and 
the other is its prohibitively high computation cost. To 
deal with the above two problems, a series of version 
space based AL methods, which are based on the 
assumption that the target function can be perfectly 
expressed by one hypothesis in the version space and 
in which the sample that can reduce the volume of the 
version space is chosen, have been proposed. Examples 
are query by committee (Freund, Seung, Shamir, & 
Tishby, 1997), and SG AL (D. Cohn, Atlas, & Ladner, 
1994). However the complexity of version space made 
them intractable until the version space based AL S VMs 
have emerged. 

The success of SVM in the 90s has prompted re- 
searchers to combine AL with SVM to deal with the 
semi-supervised learning problems, such as distance- 
based (Tong & Roller, 2001), RETIN (Gosselin & Cord, 
2004) and Multi-view (Cheng & Wang, 2007) based 
ALSVMs. In the following sections, we summarize 
existing well-known ALSVMs under the framework 
of version space theory, and then briefly describe 
some mixed strategies. Lastly, we will discuss the 
research trends for ALSVM and give conclusions for 
the chapter. 



VERSION SPACE BASED ACTIVE 
LEARNING WITH SVM 

The idea of almost all existing heuristic ALSVMs is 
explicitly or implicitly to find the sample which can 
reduce the volume of the version space. In this section, 
we first introduce their theoretical foundation and then 
review some typical ALSVMs. 
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Figure 1. Framework of active learning 



Initialize Step: An classifier h is trained on the initial labeled training set L 

step 1: The learner evaluates each data x in potential query set Q (subset of or whole 

unlabeled data set U) and query the sample x* which has lowest EvalFun(x, L, 

h, H) to the oracle and get its label y*; 

The learner update the classifier h with the enlarged training set {L + ( x*, 

y*)}; 

Repeat step 1 and 2 until stopping training; 



step 2: 



step 3: 



Where 



> 
> 



EvalFun(x, L, h, H): the function of evaluating potential query x (the lowest 

value is the best here) 

L: the current labeled training set 

H: the hypothesis space 



Version Space Theory 

Based on the Probability Approximation Correct learn- 
ing model, the goal of machine learning is to find a 
consistent classifier which has the lowest generaliza- 
tion error bound. The Gibbs generalization error bound 
(McAllester, 1998) is defined as 



J Gibbs 
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m 
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where P H denotes a prior distribution over hypothesis 
space H, V(z) denotes the version space of the training 
set z, m is the number of z and 8 is a constant in [0, 1]. 
It follows that the generalization error bound of the 
consistent classifiers is controlled by the volume of the 
version space if the distribution of the version space 
is uniform. This provides a theoretical justification for 
version space based ALSVMs. 

Query by Committee with SVM 

This algorithm was proposed by (Freund et al., 1997) 
in which 2/c classifiers were randomly sampled and 
the sample on which these classifiers have maximal 
disagreement can approximately halve the version 
space and then will be queried to the oracle. However, 
the complexity of the structure of the version space 
leads to the difficulty of random sampling within it. 



(Warmuth, Ratsch, Mathieson, Liao, & Lemmem, 2003) 
successfully applied the algorithm of playing billiard 
to randomly sample the classifiers in the SVM version 
space and the experiments showed that its performance 
was comparable to the performance of standard dis- 
tance-based ALSVM (SD-ALSVM) which will be 
introduced later. The deficiency is that the processes 
are time-consuming. 

Standard Distance Based Active 
Learning with SVM 

For SVM, the version space can be defined as: 

V = {veW|||w| = l, y / (w*0(x / ) >0, z=l,...,m} 

where <P(.) denotes the function which map the original 
input space X into a high-dimensional space O ( X ) , and 
W denotes the parameter space. SVM has two proper- 
ties which lead to its tractability with AL. The first is 
its duality property that each point w in V corresponds 
to one hyperplane in <P(X) which divides <P(X) into 
two parts and vice versa. The other property is that 
the solution of SVM w* is the center of the version 
space when the version space is symmetric or near to 
its center when it is asymmetric. 

Based on the above two properties, (Tong & Roller, 
2001) inferred a lemma that the sample nearest to the 



Active Learning with SVM 



Figure 2. Illustration of standard distance-based ALSVM 
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Figure 2a. The projection of the parameter space around the Version Space 
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Figure 2b. In the induced feature space 



decision boundary can make the expected size of the 
version space decrease fastest. Thus the sample nearest 
to the decision boundary will be queried to the oracle 
(Figure 2). This is the so-called SD-ALS VM which has 
low additional computations for selecting the queried 
sample and fine performance in real applications. 

Batch Running Mode Distance Based 
Active Learning with SVM 

When utilizing batch query, (Tong & Roller, 2001) 
simply selected multiple samples which are nearest to 
the decision boundary. However, adding a batch of such 
samples cannot ensure the largest reduction of the size 
of version space, such as an example shown in figure 
3. Although every sample can nearly halve the version 
space, three samples together can still reduce about 1/2, 



instead of 7/8, of the size of the version space. It can 
be observed that this was ascribed to the small angles 
between their induced hyperplanes. 

To overcome this problem, (Brinker, 2003) proposed 
a new selection strategy by incorporating diversity 
measure that considers the angles between the induced 
hyperplanes. Let the labeled set be L and the pool query 
set be Q in the current round, then based on the diversity 
criterion the further added sample x should be 



x„ = min max 



|fc(Xj>*i)| 






^HXj'XjMXi'Xi) 
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Figure 3. One example of simple batch querying with "a", "b" and "c" samples with pure SD-ALSVM 
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Figure 4. One example of batch querying with "a", "b" and "c" samples by incorporating diversity into SD- 
ALSVM 
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where denotes the cosine value of the angle between 
two hyperplanes induced by x.and x., thus it is known 
as angle diversity criterion. It can be observed that 
the reduced volume of the version space in figure 4 is 
larger than that in Figure 3. 

RETIN Active Learning 

Let (i" J -) M i... n] be the samples in a potential query set 
Q, and r(i, k) be the function that, at iteration z, codes 
the position k in the relevance ranking according to 
the distance to the current decision boundary, then a 
sequence can be obtained as follows: 



*r(i,l) >*r(i,2)>'">*r(i,s(i) >'"> r(i,s(i)+m-iv? *r(i,n) 



most relevant 



queried data 



least relevant 



In SD-ALSVM, s(i) is such as I ,. ,., ,...,!,. ,., , 

7 v ' r(i,s(i) > > r(i,s(i)+m-l 

are the m closest samples to the SVM boundary. This 
strategy implicitly relies on a strong assumption: an 
accurate estimation of SVM boundary. However, the 
decision boundary is usually unstable at the initial 
iterations. (Gosselin & Cord, 2004) noticed that, even 
if the decision boundary may change a lot during the 
earlier iterations, the ranking function r() is quite stable. 
Thus they proposed a balanced selection criterion that 
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is independent on the frontier and in which an adaptive 
method was designed to tune s during the feedback 
iterations. It was expressed by 

s(i + l) = s(i) + h(r rel (i)r irr (i) 

where h(x, y) = kx(x- y) which characterizes the 
system dynamics (k is a positive constant), r rel (i) and 
r irrl (i) denote the number of relevant and irrelevant 
samples in the queried set in the ith iteration. This way, 
the number of relevant and irrelavant samples in the 
queried set will be roughly equal. 

Mean Version Space Criterion 

(He, Li, Zhang, Tong, & Zhang, 2004) proposed a 
selection criterion by minimizing the mean version 
space which is defined as 

C MVS (x k )=Vol(V;(x k ) P(y k =l\x k )+Vol(Vr(x k ) P(y k =-l\x k ) 

where Vo/(VJ + (xJ (Vo/(Vp(x k ))denotesthevolumeof 
the version space after adding an unlabelled sample x k 
into the ith round training set. The mean version space 
includes both the volume of the version space and the 
posterior probabilities. Thus they considered that the 
criterion is better than the SD-ALSVM. However, the 
computation of this method is time-consuming. 

Multi-View Based Active Learning 

Different from the algorithms which are based only on 
one whole feature set, multi-view methods are based 
on multiple sub-feature ones. Several classifiers are 
first trained on different sub-feature sets. Then the 
samples on which the classifiers have the largest dis- 
agreements comprise the contention set from which 
queried samples are selected, first (I. Muslea, Minton, 
& Knoblock, 2000) applied in AL and (Cheng & Wang, 
2007) implemented it with ALSVM to produce a Co- 
SVM algorithm which was reported to have better 
performance than the SD-ALSVM. 

Multiple classifiers can find the rare samples be- 
cause they observe the samples with different views. 
Such property is very useful to find the diverse parts 
belonging to the same category. However, multi-view 
based methods demand that the relevant classifier can 
classify the samples well and that all feature sets are 



uncorrected. It is difficult to ensure this condition in 
real applications. 



MIXED ACTIVE LEARNING 

Instead of single AL strategies in the former sections, 
we will discuss two mixed AL modes in this section: 
one is combining different selection criteria and another 
is incorporating semi-supervised learning into AL. 

Hybrid Active Learning 

Contrast to developing a new AL algorithm that 
works well for all situations, some researchers argued 
that combining different methods, which are usually 
complementary, is a better way, for each method has its 
advantages and disadvantages. The intuitive structure of 
the hybrid strategy is parallel mode. The key point here 
is how to set the weights of different AL methods. 

The simplest way is to set fixed weights according 
to experience and it was used by most existing meth- 
ods. The Most Relevant/Irrelevant (L. Zhang, Lin, & 
Zhang, 2001) strategies can help to stabilize the decision 
boundary, but have low learning rates; while standard 
distance-based methods have high learning rates, but 
have unstable frontiers at the initial feedbacks. Consid- 
ering this, (Xu, Xu, Yu, & Tresp, 2003) combined these 
two strategies to achieve better performance than only 
using a single strategy. As stated before, the diversity 
and distance-based strategies are also complementary 
and(Brinker, 2003), (Ferecatu, Crucianu, &Boujemaa, 
2004) and (Dagli, Rajaram, & Huang, 2006) combined 
angle, inner product and entropy diversity strategy with 
standard distance-based one respectively. 

However, the strategy of the fixed weights can not fit 
well into all datasets and all learning iterations. So the 
weights should be set dynamically. In (Baram, El- Yaniv, 
& Luz, 2004), all the weights were initialized with the 
same value, and were modified in the later iterations 
by using EXP4 algorithm. In this way, the resulting AL 
algorithm is empirically shown to consistently perform 
almost as well as and sometimes outperform the best 
algorithm in the ensemble. 
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Semi-Supervised Active Learning 

1. Active Learning with Transductive SVM 

In the first stages of SD-ALSVM, a few labeled data 
may lead to great deviation of the current solution 
from the true solution; while if unlabeled samples are 
considered, the solution may be closer to the true solu- 
tion. (Wang, Chan, & Zhang, 2003) showed that the 
closer the current solution is to the true one, the larger 
the size of the version space will be reduced. They 
incorporated Transductive SVM (TSVM) to produce 
more accurate intermediate solutions. However, sev- 
eral studies (T. Zhang & Oles, 2000) challenged that 
TSVM might not be so helpful from unlabeled data 
in theory and in practice. (Hoi & Lyu, 2005) applied 
the semi-supervised learning techniques based on the 
Gaussian fields and Harmonic functions instead and the 
improvements were reported to be significant. 

2. Incorporating EM into Active Learning 

(McCallum & Nigam, 1998) combined Expectation 
Maximization (EM) with the strategy of querying by 
committee. And (Ion Muslea, Minton, & Knoblock, 
2002) integrated Multi-view AL algorithm with EM 
to get the Co-EMT algorithm which can work well 
in the situation where the views are incompatible and 
correlated. 



FUTURE TRENDS 

How to Start the Active Learning 

AL can be regarded as the problem of searching target 
function in the version space, so a good initial classifier 
is important. When the objective category is diverse, 
the initial classifier becomes more important, for bad 
one may result in converging to a local optimal solu- 
tion, i.e., some parts of the objective category may not 
be correctly covered by the final classifier. Two-stage 
(Cord, Gosselin, & Philipp-Foliguet, 2007), long-term 
learning (Yin, Bhanu, Chang, & Dong, 2005), and 
pre-cluster (Engelbrecht & BRITS, 2002) strategies 
are promising. 



Feature-Based Active Learning 

In AL, the feedback from the oracle can also help to 
identify the important features, and (Raghavan, Madani, 
& Jones, 2006) showed that such works can improve the 
performance of the final classifier significantly. In (Su, 
Li, & Zhang, 2001), Principal Components Analysis was 
used to identify important features. To our knowledge, 
there are few reports addressing the issue. 

The Scaling of Active Learning 

The scaling of AL to very large database has not been 
extensively studied yet. However, it is an important 
issue for many real applications. Some approaches 
have been proposed on how to index database (Lai, 
Goh, & Chang, 2004) and how to overcome the concept 
complexities accompanied with the scalability of the 
dataset (Panda, Goh, & Chang, 2006). 



CONCLUSION 

In this chapter, we summarize the techniques of AL SVM 
which have been an area of active research since 2000. 
We first focus on the descriptions of heuristic ALSVM 
approaches within the framework of the theory of ver- 
sion space minimization. Then mixed methods which 
can complement the deficiencies of single ones are 
introduced and finally future research trends focus on 
techniques for selecting the initial labeled training set, 
feature-based AL and the scaling of AL to very large 
database. 
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KEY TERMS 

Heuristic Active Learning: The set of active 
learning algorithms in which the sample selection 
criteria is based on some heuristic objective function. 
For example, version space based active learning is 
to select the sample which can reduce the size of the 
version space. 

Hypothesis Space: The set of all hypotheses 
in which the objective hypothesis is assumed to be 
found. 

Semi-Supervised Learning: The set of learning 
algorithms in which both labelled and unlabelled data 
in the training dataset are directly used to train the 
classifier. 

Statistical Active Learning: The set of active 
learning algorithms in which the sample selection 
criteria is based on some statistical objective function, 
such as minimization of generalisation error, bias and 
variance. Statistical active learning is usually statisti- 
cally optimal. 

Supervised Learning: The set of learning algo- 
rithms in which the samples in the training dataset are 
all labelled. 

Unsupervised Learning: The set of learning al- 
gorithms in which the samples in training dataset are 
all unlabelled. 

Version Space: The subset of the hypothesis space 
which is consistent with the training set. 
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INTRODUCTION 

This chapter spans topics from such important areas 
as Artificial Intelligence, Computational Geometry 
and Biometric Technologies. The primary focus is on 
the proposed Adaptive Computation Paradigm and 
its applications to surface modeling and biometric 
processing. 

Availability of much more affordable storage and 
high resolution image capturing devices have contrib- 
uted significantly over the past few years to accumulat- 
ing very large datasets of collected data (such as GIS 
maps, biometric samples, videos etc.). On the other 
hand, it also created significant challenges driven by 
the higher than ever volumes and the complexity of the 
data, that can no longer be resolved through acquisition 
of more memory, faster processors or optimization of 
existing algorithms. These developments justified the 
need for radically new concepts for massive data stor- 
age, processing and visualization. To address this need, 
the current chapter presents the original methodology 
based on the paradigm of the Adaptive Geometric 
Computing. The methodology enables storing complex 
data in a compact form, providing efficient access to it, 
preserving high level of details and visualizing dynamic 
changes in a smooth and continuous manner. 

The first part of the chapter discusses adaptive al- 
gorithms in real-time visualization, specifically in GIS 
(Geographic Information Systems) applications. Data 
structures such as Real-time Optimally Adaptive Mesh 
(ROAM) and Progressive Mesh (PM) are briefly sur- 
veyed. The adaptive method Adaptive Spatial Memory 
(ASM), developed by R. Apu and M. Gavrilova, is 
then introduced. This method allows fast and efficient 
visualization of complex data sets representing terrains, 
landscapes and Digital Elevation Models (DEM). Its 
advantages are briefly discussed. 

The second part of the chapter presents application 
of adaptive computation paradigm and evolutionary 
computing to missile simulation. As a result, patterns 
of complex behavior can be developed and analyzed. 



The final part of the chapter marries a concept of 
adaptive computation and topology-based techniques 
and discusses their application to challenging area of 
biometric computing. 



BACKGROUND 

For a long time, researchers were pressed with questions 
on how to model real-world objects (such as terrain, 
facial structure or particle system) realistically, while at 
the same time preserving rendering efficiency and space. 
As a solution, grid, mesh, TIN, Delaunay triangulation- 
based and other methods for model representation were 
developed over the last two decades. Most of these 
are static methods, not suitable for rendering dynamic 
scenes or preserving higher level of details. 

In 1 997, first methods for dynamic model represen- 
tation: Real-time Optimally Adapting Mesh (ROAM) 
(Duchaineauy et. al., 1997, Lindstrom and Roller, 
1996) and Progressive Mesh (PM) (Hoppe, 1997) were 
developed. Various methods have been proposed to 
reduce a fine mesh into an optimized representation so 
that the optimized mesh contains less primitives and 
yields maximum detail. However, this approach had 
two major limitations. Firstly, the cost of optimization 
is very expensive (several minutes to optimize one 
medium sized mesh). Secondly, the generated non- 
uniform mesh is still static. As a result, it yields poor 
quality when only a small part of the mesh is being 
observed. Thus, even with the further improvements, 
these methods were not capable of dealing with large 
amount of complex data or significantly varied level 
of details. They have soon were replaced by a different 
computational model for rendering geometric meshes 
(Li Sheng et. al. 2003, Shafae and Pajarola, 2003). The 
model employs a continuous refinement criteria based 
on an error metric to optimally adapt to a more accurate 
representation. Therefore, given a mesh representation 
and a small change in the viewpoint, the optimized mesh 
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for the next viewpoint can be computed by refining the 
existing mesh. 



ADAPTIVE GEOMETRIC COMPUTING 

This chapter presents Adaptive Multi-Resolution 

Technique for real-time terrain visualization utiliz- 
ing a clever way of optimizing mesh dynamically for 
smooth and continuous visualization with a very high 
efficiency (frame rate) (Apu and Gavrilova (2005) 
(2007)). Our method is characterized by the efficient 
representation of massive underlying terrain, utilizes 
efficient transition between detail levels, and achieves 
frame rate constancy ensuring visual continuity. At the 
core of the method is adaptive processing: a formalized 
hierarchical representation that exploits the subsequent 
refinement principal. This allows us a full control over 
the complexity of the feature space. An error metric is 
assigned by a higher level process where obj ects (or fea- 
tures) are initially classified into different labels. Thus, 
this adaptive method is highly useful for feature space 
representation. In 2006, Gavrilova and Apu showed 
that such methods can act as a powerful tool not only 
for terrain rendering, but also for motion planning and 
adaptive simulations (Apu and Gavrilova, 2006). They 
introduced Adaptive Spatial Memory (ASM) model 
that utilizes adaptive approach for real-time online 
algorithm for multi-agent collaborative motion plan- 
ning. They have demonstrate that the powerful notion 
of adaptive computation can be applied to perception 
and understanding of space. Extension of this method 
for 3D motion planning as part of collaborative research 
with Prof. I. Kolingerova group has been reported to be 



significantly more efficient than conventional methods 
(Broz et.al., 2007). 

We first move to discuss evolutionary computing. 
We demonstrate the power of adaptive computation by 
developing and applying adaptive computational model 
to missile simulation (Apu and Gavrilova, 2006). The 
developed adaptive algorithms described above have a 
property that spatial memory units can form, refine and 
collapse to simulate learning, adapting and responding 
to stimuli. The result is a complex multi-agent learning 
algorithm that clearly demonstrates organic behaviors 
such as sense of territory, trails, tracks etc. observed in 
flocks/herds of wild animals and insects. This gives a 
motivation to explore the mechanism in application to 
swarm behavior modeling. 

Swarm Intelligence (SI) is the property of a system 
whereby the collective behaviors of unsophisticated 
agents interacting locally with their environment cause 
coherent functional global patterns to emerge (Bo- 
nabeau, 1999). Swarm intelligence provides a basis 
for exploration of a collective (distributed) behavior 
of a group of agents without centralized control or the 
provision of a global model. Agents in such system 
have limited perception (or intelligence) and cannot 
individually carry out the complex tasks. According 
to Bonebeau, by regulating the behavior of the agents 
in the swarm, one can demonstrate emergent behavior 
and intelligence as a collective phenomenon. Although 
the swarming phenomenon is largely observed in 
biological organisms such as an ant colony or a flock 
of birds, it is recently being used to simulate complex 
dynamic systems focused towards accomplishing a 
well-defined objective (Kennedy, 2001, Raupp ans 
Thalmann, 2001). 



Figure 1 . Split and merge operations in ASM model 
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Let us now investigate application of the adaptive 
computational paradigm and swarm intelligence con- 
cept to missile behavior simulation ( Apu and Gavrilova, 
2006). First of all, let us note that complex strategic 
behavior can be observed by means of a task oriented 
artificial evolutionary process in which behaviors of 
individual missiles are described in surprising simplic- 
ity. Secondly, the global effectiveness and behavior of 
the missile swarm is relatively unaffected by disruption 
or destruction of individual units. From a strategic point 
of view, this adaptive behavior is a strongly desired 
property in military applications, which motivates 
our interest in applying it to missile simulation. Note 
that this problem was chosen as it presents a complex 
challenge for which an optimum solution is very hard 
to obtain using traditional methods. The dynamic and 
competitive relationship between missiles and turrets 
makes it extremely difficult to model using a determin- 
istic approach. It should also be noted that the problem 
has an easy evaluation metric that allows determining 
fitness values precisely. 

Now, let us summarize the idea of evolutionary 
optimization by applying genetic algorithm to evolve 
the missile genotype. We are particularly interested in 
observing the evolution of complex 3D formations and 
tactical strategies that the swarm learns to maximize 
their effectiveness during an attack simulation run. The 
simulation is based on attack, evasion and defense. 
While the missile sets strategy to strike the target, the 
battle ship prepares to shoot down as many missiles 
as possible (Figure 2 illustrates the basic missile ma- 



Figure 2. Basic maneuvers for a missile using the 
Gene String 



Basic Maneuvers 




neuvers). Each attempt to destroy the target is called 
an attack simulation run. Its effectiveness equals to 
the number of missiles hitting the target. Therefore the 
outcome of the simulation is easily quantifiable. On the 
other hand, the interaction between missiles and the 
battleship is complex and nontrivial. As a result, war 
strategies may emerge in which a local penalty (i.e. 
sacrificing a missile) can optimize global efficiency (i.e. 
deception strategy). The simplest form of information 
known to each missile is its position and orientation 
and the location of the target. This information is aug- 
mented with information about missile neighborhood 
and environment, which influences missile navigation 
pattern. For actual missile behavior simulation, we 
use strategy based on the modified version of Boids 
flocking technique. 

We have just outlined the necessary set of actions 
to reach the target or interact with the environment. 
This is the basic building block of missile navigation. 
The gene string is another important part that reflects 
the complexity with which such courses of action 
could be chosen. It contains a unique combination of 
maneuvers (such as attack, evasion, etc.) that evolve 
to create complex combined intelligence. We describe 
the fitness of the missile gene in terms of collective 
performance. After investigating various possibilities, 
we developed and used a two dimensional adaptive 
fitness function to evolve the missile strains in one 
evolutionary system. Details on this approach can be 
found in (Apu and Gavrilova, 2006). 

After extensive experimentation, we have found 
many interesting characteristics, such as geometric at- 
tack formation and organic behaviors observed among 
swarms in addition to the highly anticipated strategies 
such as simultaneous attack, deception, retreat and 
other strategies (see Figure 3). We also examined the 
adaptability by randomizing the simulation coordinates, 
distance, initial formation, attack rate, and other param- 
eters of missiles and measured the mean and variance 
of the fitness function. Results have shown that many 
of the genotypes that evolved are highly adaptive to 
the environment. 

We have just reviewed the application of the adap- 
tive computational paradigm to swarm intelligence and 
briefly described the efficient tactical swarm simulation 
method (Apu and Gavrilova 2006). The results clearly 
demonstrate that the swarm is able to develop complex 
strategy through the evolutionary process of genotype 
mutation. This contribution among other works on 
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adaptive computational intelligence will be profiled in 
detail in the upcoming book as part of Springer- Verlag 
book series on Computational Intelligence (Gavrilova, 
2007). 

As stated in the introduction, adaptive computation 
is based on a variable complexity level of detail para- 
digm, where a physical phenomenon can be simulated 
by the continuous process of local adaptation of spatial 
complexity. As presented by M. Gavrilova in Plenary 
Lecture at 3IA Eurographics Conference, France in 
2006, the adaptive paradigm is a powerful compu- 
tational model that can also be applied to vast area 
of biometric research. This section therefore reviews 
methods and techniques based on adaptive geomet- 
ric methods in application to biometric problems. It 
emphasizes advantages that intelligent approach to 
geometric computing brings to the area of complex 
biometric data processing (Gavrilova 2007). 

In information technology, biometrics refers to a 
study of physical and behavioral characteristics with 
the purpose of person identification (Yanushkevich, 
Gavrilova, Wang and Srihari, 2007). In recent years, the 
area of biometrics has witnessed a tremendous growth, 
partly as a result of a pressing need for increased secu- 
rity, and partly as a response to the new technological 
advances that are literally changing the way we live. 
Availability of much more affordable storage and the 
high resolution image biometric capturing devices 
have contributed to accumulating very large datasets of 
biometric data. In the earlier sections, we have studied 
the background of the adaptive mesh generation. Let 
us now look at the background research in topology- 
based data structures, and its application to biometric 
research. This information is highly relevant to goals of 
modeling and visualizing complex biometric data. At 



the same time as adaptive methodology was developing 
in GIS, interest to topology-based data structures, such 
as Voronoi diagrams and Delaunay triangulations, 

has grown significantly. Some preliminary results on 
utilization of these topology-based data structures in 
biometric began to appear. For instance, research on 
image processing using Voronoi diagrams was presented 
in (Liang and Asano, 2004, Asano, 2006), studies of 
utilizing Voronoi diagram for fingerprint synthesis 
were conducted by (Bebis et. al., 1999, Capelli et. al. 
2002), and various surveys of methods for modeling 
of human faces using triangular mesh appeared in 
(Wen and Huang, 2004, Li and Jain, 2005, Wayman 
et. al. 2005). Some interesting results were recently 
obtained in the BTLab, University of Calgary, through 
the development of topology-based feature extrac- 
tion algorithms for fingerprint matching (Wang et. al. 
2006, 2007, illustration is found in Figure 4), 3D facial 
expression modeling (Luo et. al. 2006) and iris syn- 
thesis (Wecker et. al. 2005). A comprehensive review 
of topology-based approaches in biometric modeling 
and synthesis can be found in recent book chapter on 
the subject (Gavrilova, 2007). 

In this chapter, we propose to manage the challenges 
arising from large volumes of complex biometric data 
through the innovative utilization of the adaptive para- 
digm. We suggest combination of topology-based and 
hierarchy based methodology to store and search for 
biometric data, as well as to optimize such representation 
based on the data access and usage. Namely, retrieval 
of the data, or creating real-time visualization can be 
based on the dynamic patter of data usage (how often, 
what type of data, how much details, etc.), recorded 
and analyzed in the process of the biometric system 
being used for recognition and identification purposes. 



Figure 3. Complex formation and attack patterns evolved 






(a) Deception pattern 



(b) Distraction pattern 



(c) Organic motion pattern 
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Figure 4. Delaunay triangulation based technique for fingerprint matching 
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In addition to using this information for optimized 
data representation and retrieval, we also propose to 
incorporate intelligent learning techniques to predict 
most likely patters of the system usage and to represent 
and organize data accordingly. 

On a practical side, to achieve our goal, we propose a 
novel way to represent complex biometric data through 
the organization of the data in a hierarchical tree-like 
structure. Such organization is similar in principle to 
the Adaptive Memory Subdivision (AMS), capable of 
representing and retrieving varies amount of informa- 
tion and level of detail that needs to be represented. 
Spatial quad-tree is used to hold the information about 
the system, as well as the instructions on how to process 
this information. Expansion is realized through the 
spatial subdivision technique that refines the data and 
increases level of details, and the collapsing is real- 
ized through the merge operation that simplifies the 
data representation and makes it more compact. The 
greedy strategy is used to optimally adapt to the best 
representation based on the user requirements, amount 
of available data and resources, required resolution and 
so on. This powerful technique enables us to achieve 
the goal of compact biometric data representation, that 
allows for instance to efficiently store minor details 
of the modeled face (e.g. scars, wrinkles) or detailed 
patterns of the iris. 



FUTURE TRENDS 

In addition to data representation, adaptive technique 
can be highly useful in biometric feature extraction with 
the purpose of fast and reliable retrieval and matching 
of the biometric data, and in implementing dynamic 



changes to the model. The methodology has a high 
potential of becoming one of the key approaches in 
biometric data modeling and synthesis. 



CONCLUSION 

The chapter reviewed the adaptive computational 
paradigm in application to surface modeling, evolu- 
tionary computing and biometric research. Some of the 
key future developments in the upcoming years will 
undoubtedly highlight the area, inspiring new genera- 
tions of intelligent biometric systems with adaptive 
behavior. 



REFERENCES 

Apu R. & Gavrilova M (2005) Geo-Mass: Modeling 
Massive Terrain in Real-Time, GEOMATICAJ. 59(3), 
313-322. 

Apu R. & Gavrilova M. (2006) Battle Swarm: An Evo- 
lutionary Approach to Complex Swarm Intelligence, 
3IAIM. C. Comp. Graphics andAI, Limoges, France, 
139-150. 

Apu, R & Gavrilova, M. (2007) Fast and Efficient 
Rendering System for Real-Time Terrain Visualization, 
IJCSE Journal, 2(2), 5/6. 

Apu, R. & Gavrilova, M. (2006) An Efficient Swarm 
Neighborhood Management for a 3D Tactical Simulator, 
IEEE-CS proceedings, ISVD 2006, 85- 93 



13 



Adaptive Algorithms for Intelligent Geometric Computing 



Asano, T. (2006) Aspect-Ratio Voronoi Diagram with 
Applications, ISVD 2006, IEEE-CS proceedings, 32- 
39 

Bebis G., Deaconu T & Georiopoulous, M. (1999) 
Fingerprint Identification using Delaunay Triangula- 
tion, ICIIS 99, Maryland, 452-459 

Bonabeau, E., Dorigo, M. & Theraulaz, G. (1999) 
Swarm Intelligence: From Natural to Artificial Systems, 
NY: Oxford Univ. Press 

Broz, P., Kolingerova, I, Zitka, P., Apu R. & Gavrilova 
M. (2007) Path planning in dynamic environment using 
an adaptive mesh, SCCG 2007, Spring Conference on 
Computer Graphics 2007, ACM SIGGRAPH 

Capelli R, Maio, D, Maltoni D. (2002) Synthetic Fin- 
gerprint-Database Generation, ICPR 2002, Canada, 
vol 3, 369-376 

Duchaineauy, M. et. al. (1997) ROAMing Terrain: 
Real-Time Optimally Adapting Meshes, IEEE Visu- 
alization '97, 81-88 

Gavrilova M.L. (2007) Computational Geometry and 
Image Processing in Biometrics : on the Path to Conver- 
gence, in Book Image Pattern Recognition: Synthesis 
and Analysis in Biometrics, Book Chapter 4, 103-133, 
World Scientific Publishers 

Gavrilova M.L. Computational Intelligence: A Geom- 
etry-Based Approach, in book series Studies in Com- 
putational Intelligence, Springer-Verlag, Ed. Janusz 
Kacprzyk, to appear. 

Gavrilova, M.L. (2006) IEEE_CS Book of the 3 rd 
International Symposium on Voronoi Diagrams in 
Science and Engineering, IEEE-CS, Softcover, 2006, 
270 pages. 

Gavrilova, M.L. (2006) Geometric Algorithms in 3D 
Real-Time Rendering and Facial Expression Modeling, 
3IA'2006 Plenary Lecture, Eurographics, Limoges, 
France, 5-18 

Hoppe, H. (1997) View-Dependent Refinement of 
Progressive Meshes, SIGGRAPH '97 Proceedings, 
189-198 

Kennedy, J., Eberhart, R. C, & Shi, Y. (2001) Swarm 
Intelligence, San Francisco: Morgan Kaufmann Pub- 
lishers 



Li Sheng, Liu Xuehui & Wu Enhau, (2003) Feature- 
Based Visibility-Driven CLOD for Terrain, In Proc. 
Pacific Graphics 2003, 313-322, IEEE Press 

Li, S. & Jain, A. (2005) Handbook of Face Recogni- 
tion. Springer-Verlag 

Liang X.F. & Asano T. (2004) A fast denoising method 
for binary fingerprint image, IASTED, Spain, 309- 
313 

Lindstrom, P. & Roller, D. (1996) Real-time continuous 
level of detail rendering of height fields, SIGGRAPH 
1996 Proceedings, 109-118 

Luo, Y, Gavrilova, M. & Sousa M.C. (2006) NPARby 
Example: line drawing facial animation from photo- 
graphs, CGIV'06, IEEE, Computer Graphics, Imaging 
and Visualization, 514-521 

Raupp S. & Thalmann D. (2001) Hierarchical Model 
for Real Time Simulation of Virtual Human Crowds, 
IEEE Trans, on Visualization and Computer Graphics 
7(2), 152-164 

Shafae, M. & Pajarola, R. (2003) Dstrips: Dynamic 
Triangle Strips for Real-Time Mesh Simplification and 
Rendering, Pacific Graphics 2003, 271-280 

Wang, C, Luo, Y, Gavrilova M & Rokne J. (2007) 
Fingerprint Image Matching Using a Hierarchical 
Approach, in Book Computational Intelligence in 
Information Assurance and Security, Springer SCI 
Series, 175-198 

Wang, H, Gavrilova, M, Luo Y. & J. Rokne (2006) An 
Efficient Algorithm for Fingerprint Matching, ICPR 
2006, Int. C. on Pattern Recognition, Hong Kong, 
IEEE-CS, 1034-1037 

Wayman J, Jain A, Maltoni D & Maio D. (2005) Bio- 
metric Systems: Technology, Design and Performance 
Evaluation, Book, Springer 

Wecker L, Samavati, F & Gavrilova M (2005) Iris 
Synthesis: A Multi-Resolution Approach, GRAPHITE 
2005, ACM Press. 121-125 

Wen, Z. & Huang, T. (2004) 3D Face Processing: 
Modeling, Analysis and Synthesis, Kluwer 

Yanushkevich, S, Gavrilova M., Wang, P & Srihari 
S. (2007) Image Pattern Recognition: Synthesis and 
Analysis in Biometrics, Book World Scientific 



14 



Adaptive Algorithms for Intelligent Geometric Computing 



KEY TERMS 

Adaptive Geometric Model (AGM): A new ap- 
proach to geometric computing utilizing adaptive com- 
putation paradigm. The model employs a continuous 
refinement criteria based on an error metric to optimally 
adapt to a more accurate representation. 

Adaptive Multi-Resolution Technique (AMRT): 

For real-time terrain visualization is a method that 
utilizes a clever way of optimizing mesh dynamically 
for smooth and continuous visualization with a high 
efficiency. 

Adaptive Spatial Memory (ASM): A hybrid 
method based on the combination of traditional hier- 
archical tree structure with the concept of expanding 
or collapsing tree nodes. 

Biometric Technology (BT): An area of study of 
physical and behavioral characteristics with the purpose 
of person authentication and identification. 



Delaunay Triangulation (DT): A computational 
geometry data structure dual to Voronoi diagram. 

Evolutionary Paradigm (EP): The collective 
name for a number of problem solving methods utiliz- 
ing principles of biological evolution, such as natural 
selection and genetic inheritance. 

Swarm Intelligence (SI): The property of a system 
whereby the collective behaviors of unsophisticated 
agents interacting locally with their environment cause 
coherent functional global patterns to emerge. 

Topology-Based Techniques (TBT): A group of 
methods using geometric properties of a set of objects 
in the space and their proximity 

Voronoi Diagram (VD): A fundamental computa- 
tional geometry data structure that stores topological 
information for a set of objects. 
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INTRODUCTION 

Since the computer age dawned on mankind, one of 
the most important areas in information technology 
has been that of "decision support." Today, this area 
is more important than ever. Working in dynamic and 
ever-changing environments, modern-day managers 
are responsible for an assortment of far reaching de- 
cisions: Should the company increase or decrease its 
workforce? Enter new markets? Develop new products ? 
Invest in research and development? The list goes on. 
But despite the inherent complexity of these issues and 
the ever-increasing load of information that business 
managers must deal with, all these decisions boil down 
to two fundamental questions: 

What is likely to happen in the future? 
What is the best decision right now? 

Whether we realize it or not, these two questions 
pervade our everyday lives — both on a personal and 
professional level. When driving to work, for instance, 
we have to make a traffic prediction before we can 
choose the quickest driving route. At work, we need 
to predict the demand for our product before we can 
decide how much to produce. And before investing in 
a foreign market, we need to predict future exchange 
rates and economic variables. It seems that regardless 
of the decision being made or its complexity, we first 
need to make a prediction of what is likely to happen 
in the future, and then make the best decision based on 
that prediction. This fundamental process underpins the 
basic premise of Adaptive Business Intelligence. 



BACKGROUND 

Simply put, Adaptive Business Intelligence is the 
discipline of combining prediction, optimization, and 
adaptability into a system capable of answering these 
two fundamental questions: What is likely to happen 
in the future? and What is the best decision right now? 



(Michalewicz et al. 2007). To build such a system, we 
first need to understand the methods and techniques that 
enable prediction, optimization, and adaptability (Dhar 
and Stein, 1997). At first blush, this subject matter is 
nothing new, as hundreds of books and articles have 
already been written on business intelligence (Vitt et 
al., 2002; Loshin, 2003), data mining and prediction 
methods (Weiss and Indurkhya, 1998; Witten and 
Frank, 2005), forecasting methods (Makridakis et al., 
1988), optimization techniques (Deb 2001; Coello et 
al. 2002; Michalewicz and Fogel, 2004), and so forth. 
However, none of these has explained how to combine 
these various technologies into a software system that is 
capable of predicting, optimizing, and adapting. Adap- 
tive Business Intelligence addresses this very issue. 

Clearly, the future of the business intelligence in- 
dustry lies in systems that can make decisions, rather 
than tools that produce detailed reports (Loshin 2003). 
As most business managers now realize, there is a 
world of difference between having good knowledge 
and detailed reports, and making smart decisions. 
Michael Kahn, a technology reporter for Reuters in 
San Francisco, makes a valid point in the January 16, 
2006 story entitled "Business intelligence software 
looks to future": 

"But analysts say applications that actually answer 
questions rather than just present mounds of data is 
the key driver of a market set to grow 10 per cent in 
2006 or about twice the rate of the business software 
industry in general. 

'Increasingly you are seeing applications being de- 
veloped that will result in some sort of action, ' said 
Brendan Barnacle, an analyst at Pacific Crest Equi- 
ties. ( It is a relatively small part now, but it is clearly 
where the future is. That is the next stage of business 
intelligence/" 
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MAIN FOCUS OF THE CHAPTER 

"The answer to my problem is hidden in my data . . . but 
I cannot dig it up!" This popular statement has been 
around for years as business managers gathered and 
stored massive amounts of data in the belief that they 
contain some valuable insight. But business manag- 
ers eventually discovered that raw data are rarely of 
any benefit, and that their real value depends on an 
organization's ability to analyze them. Hence, the need 
emerged for software systems capable of retrieving, 
summarizing, and interpreting data for end-users (Moss 
and Atre, 2003). 

This need fueled the emergence of hundreds of 
business intelligence companies that specialized in 
providing software systems and services for extract- 
ing knowledge from raw data. These software systems 
would analyze a company's operational data and provide 
knowledge in the form of tables, graphs, pies, charts, 
and other statistics. For example, a business intelligence 
report may state that 57% of customers are between the 
ages of 40 and 50, or that product X sells much better 
in Florida than in Georgia. 1 

Consequently, the general goal of most business 
intelligence systems was to: (1) access data from a 
variety of different sources; (2) transform these data 
into information, and then into knowledge; and (3) 
provide an easy-to-use graphical interface to display 
this knowledge. In other words, a business intelligence 
system was responsible for collecting and digesting data, 
and presenting knowledge in a friendly way (thus en- 
hancing the end-user's ability to make good decisions). 
The diagram in Figure 1 illustrates the processes that 
underpin a traditional business intelligence system. 

Although different texts have illustrated the relation- 
ship between data and knowledge in different ways (e.g., 



Davenport and Prusak, 2006; Prusak, 1997; Shortliffe 
and Cimino, 2006), the commonly accepted distinction 
between data, information, and knowledge is: 

Data are collected on a daily basis in the form of 
bits, numbers, symbols, and "objects." 
Information is "organized data," which are pre- 
processed, cleaned, arranged into structures, and 
stripped of redundancy. 
• Knowledge is "integrated information," which 
includes facts and relationships that have been 
perceived, discovered, or learned. 

Because knowledge is such an essential component 
of any decision-making process (as the old saying 
goes, "Knowledge is power!"), many businesses have 
viewed knowledge as the final objective. But it seems 
that knowledge is no longer enough. A business may 
"know" a lot about its customers — it may have hun- 
dreds of charts and graphs that organize its customers 
by age, preferences, geographical location, and sales 
history — but management may still be unsure of 
what decision to make! And here lies the difference 
between "decision support" and "decision making": 
all the knowledge in the world will not guarantee the 
right or best decision. 

Moreover, recent research in psychology indicates 
that widely held beliefs can actually hamper the deci- 
sion-making process. For example, common beliefs like 
"the more knowledge we have, the better our decisions 
will be," or "we can distinguish between useful and 
irrelevant knowledge," are not supported by empirical 
evidence. Having more knowledge merely increases 
our confidence, but it does not improve the accuracy of 
our decisions. Similarly, people supplied with "good" 
and "bad" knowledge often have trouble distinguishing 



Figure 1. The processes that underpin a traditional business intelligence system 
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between the two, proving that irrelevant knowledge 
decreases our decision-making effectiveness. 

Today, most business managers realize that a 
gap exists between having the right knowledge and 
making the right decision. Because this gap affects 
management's ability to answer fundamental business 
questions (such as "What should be done to increase 
profits? Reduce costs? Or increase market share?"), 
the future of business intelligence lies in systems that 
can provide answers and recommendations, rather than 
mounds of knowledge in the form of reports. The future 
of business intelligence lies in systems that can make 
decisions! As a result, there is a new trend emerging in 
the marketplace called Adaptive Business Intelligence. 
In addition to performing the role of traditional busi- 
ness intelligence (transforming data into knowledge), 
Adaptive Business Intelligence also includes the deci- 
sion-making process, which is based on prediction and 
optimization as shown in Figure 2. 

While business intelligence is often defined as "a 
broad category of application programs and technolo- 
gies for gathering, storing, analyzing, and providing 
access to data," the term Adaptive Business Intelligence 
can be defined as "the discipline of using prediction and 
optimization techniques to build self -learning ' decision- 
ing' systems" (as the above diagram shows). Adaptive 
Business Intelligence systems include elements of data 
mining, predictive modeling, forecasting, optimization, 
and adaptability, and are used by business managers to 
make better decisions. 

This relatively new approach to business intelligence 
is capable of recommending the best course of action 



(based on past data), but it does so in a very special way: 
An Adaptive Business Intelligence system incorporates 
prediction and optimization modules to recommend 
near-optimal decisions, and an "adaptability module" 
for improving future recommendations. Such systems 
can help business managers make decisions that in- 
crease efficiency, productivity, and competitiveness. 
Furthermore, the importance of adaptability cannot be 
overemphasized. After all, what is the point of using a 
software system that produces sub par schedules, inac- 
curate demand forecasts, and inferior logistic plans, time 
after time? Would it not be wonderful to use a software 
system that could adapt to changes in the marketplace? 
A software system that could improve with time? 



FUTURE TRENDS 

The concept of adaptability is certainly gaining popu- 
larity, and not just in the software sector. Adaptability 
has already been introduced in everything from auto- 
matic car transmissions (which adapt their gear-change 
patterns to a driver's driving style), to running shoes 
(which adapt their cushioning level to a runner's size 
and stride), to Internet search engines (which adapt 
their search results to a user's preferences and prior 
search history). These products are very appealing for 
individual consumers, because, despite their mass pro- 
duction, they are capable of adapting to the preferences 
of each unique owner after some period of time. 

The growing popularity of adaptability is also 
underscored by a recent publication of the US De- 



Figure 2. Adaptive business intelligence system 
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partment of Defense. This lists 19 important research 
topics for the next decade and many of them include 
the term "adaptive": Adaptive Coordinated Control in 
the Multi-agent 3D Dynamic Battlefield, Control for 
Adaptive and Cooperative Systems, Adaptive System 
Interoperability, Adaptive Materials for Energy- Absorb- 
ing Structures, and Complex Adaptive Networks for 
Cooperative Control. 

For sure, adaptability was recognized as important 
component of intelligence quite some time ago: Alfred 
Binet (born 1857), French psychologist and inventor 
of the first usable intelligence test, defined intelligence 
as "... judgment, otherwise called good sense, practi- 
cal sense, initiative, the faculty of adapting one's self 
to circumstances." Adaptability is a vital component 
of any intelligent system, as it is hard to argue that a 
system is "intelligent" if it does not have the capacity 
to adapt. For humans, the importance of adaptability 
is obvious: our ability to adapt was a key element in 
the evolutionary process. In psychology, a behavior or 
trait is adaptive when it helps an individual adjust and 
function well within a changing social environment. 
In the case of artificial intelligence, consider a chess 
program capable of beating the world chess master: 
Should we call this program intelligent? Probably 
not. We can attribute the program's performance to its 
ability to evaluate the current board situation against a 
multitude of possible "future boards" before selecting 
the best move. However, because the program cannot 
learn or adapt to new rules, the program will lose its 
effectiveness if the rules of the game are changed or 
modified. Consequently, because the program is inca- 
pable of learning or adapting to new rules, the program 
is not intelligent. 

The same holds true for any expert system. No one 
questions the usefulness of expert systems in some en- 
vironments (which are usually well defined and static), 
but expert systems that are incapable of learning and 
adapting should not be called "intelligent." Some expert 
knowledge was programmed in, that is all. 

So, what are the future trends for Adaptive Business 
Intelligence? In words of Jim Goodnight, the CEO of 
SAS Institute (Collins et al. 2007): 

"Until recently, business intelligence was limited to 
basic query and reporting, and it never really provided 
that much intelligence . . .. " 



However, this is about to change. Keith Collins, the 
Chief Technology Officer of SAS Institute (Collins et 
al. 2007) believes that: 

"A new platform definition is emerging for business 
intelligence, where BI is no longer defined as simple 
query and reporting. [...] In the next five years, we'll 
also see a shift in performance management to what 
we're calling predictive performance management, 
where analytics play a huge role in moving us beyond 
just simple metrics to more powerful measures. " 

Further, Jim Davis, the VP Marketing of SAS 
Institute (Collins et al. 2007) stated: 

"In the next three to five years, we'll reach a tipping 
point where more organizations will be using BI to 
focus on how to optimize processes and influence the 

bottom line ...." 

Finally, it would be important to incorporate adapt- 
ability in prediction and optimization components of 
the future Adaptive Business Intelligence systems. 

There are some recent, successful implementations 
of Adaptive Business Intelligence systems reported 
(e.g., Michalewicz et al. 2005), which provide daily 
decision support for large corporations and result in 
multi-million dollars return on investment. There are 
also companies (e.g., www.solveitsoftware.com) which 
specialize in development of Adaptive Business Intelli- 
gence tools. However, further research effort is required. 
For example, most of the research in machine learning 
has focused on using historical data to build prediction 
models. Once the model is built and evaluated, the goal 
is accomplished. However, because new data arrive at 
regular intervals, building and evaluating a model is just 
the first step in Adaptive Business Intelligence. Because 
these models need to be updated regularly (something 
that the adaptability module is responsible for), we 
expect to see more emphasis on this updating process 
in machine learning research. Also, the frequency of 
updating the prediction module, which can vary from 
seconds (e.g., in real-time currency trading systems), 
to weeks and months (e.g., in fraud detection systems) 
may require different techniques and methodologies. 
In general, Adaptive Business Intelligence systems 
would include the research results from control theory, 
statistics, operations research, machine learning, and 
modern heuristic methods, to name a few. We also 
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expect that major advances will continue to be made in 
modern optimization techniques. In the years to come, 
more and more research papers will be published on 
constrained and multi-objective optimization problems, 
and on optimization problems set in dynamic environ- 
ments. This is essential, as most real-world business 
problems are constrained, multi-objective, and set in 
a time-changing environment. 



Intelligence is all about. Systems based on Adaptive 
Business Intelligence aim at solving real-world busi- 
ness problems that have complex constraints, are set 
in time-changing environments, have several (possibly 
conflicting) objectives, and where the number of pos- 
sible solutions is too large to enumerate. Solving these 
problems requires a system that incorporates modules 
for prediction, optimization, and adaptability. 



CONCLUSION 
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TERMS AND DEFINITIONS 

Adaptive Business Intelligence: The discipline of 
using prediction and optimization techniques to build 
self-learning 'decisioning' systems". 

Business Intelligence: A collection of tools, 
methods, technologies, and processes needed to 
transform data into actionable knowledge. 

Data: Pieces collected on a daily basis in the form 
of bits, numbers, symbols, and "objects." 

Data Mining: The application of analytical methods 
and tools to data for the purpose of identifying patterns, 
relationships, or obtaining systems that perform useful 
tasks such as classification, prediction, estimation, or 
affinity grouping. 

Information: "Organized data," which are prepro- 
cessed, cleaned, arranged into structures, and stripped 
of redundancy. 

Knowledge: "Integrated information," which in- 
cludes facts and relationships that have been perceived, 
discovered, or learned. 

Optimization: Process of finding the solution that 
is the best fit to the available resources. 

Prediction: A statement or claim that a particular 
event will occur in the future. 



ENDNOTE 

1 Note that business intelligence can be defined both 
as a "state" (a report that contains knowledge) and 
a "process" (software responsible for converting 
data into knowledge). 
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INTRODUCTION 



BACKGROUND 



Artificial neural networks (ANNs) (McCulloch & Pitts, 
1943) (Haykin, 1999) were developed as models of their 
biological counterparts aiming to emulate the real neural 
systems and mimic the structural organization and func- 
tion of the human brain. Their applications were based 
on the ability of self-designing to solve a problem by 
learning the solution from data. A comparative study of 
neural implementations running principal component 
analysis (PCA) and independent component analysis 
(ICA) was carried out. Artificially generated data ad- 
ditively corrupted with white noise in order to enforce 
randomness were employed to critically evaluate and 
assess the reliability of data proj ections. Analysis in both 
time and frequency domains showed the superiority of 
the estimated independent components (ICs) relative 
to principal components (PCs) in faithful retrieval of 
the genuine (latent) source signals. 

Neural computation belongs to information pro- 
cessing dealing with adaptive, parallel, and distributed 
(localized) signal processing. In data analysis, a com- 
mon task consists in finding an adequate subspace of 
multivariate data for subsequent processing and inter- 
pretation. Linear transforms are frequently employed 
in data model selection due to their computational and 
conceptual simplicity. Some common linear transforms 
are PCA, factor analysis (FA), projection pursuit (PP), 
and, more recently, ICA (Comon, 1994). The latter 
emerged as an extension of nonlinear PCA (Hotelling, 
1993) and developed in the context of blind source 
separation (BSS) (Cardoso, 1998) in signal and array 
processing. ICA is also related to recent theories of 
the visual brain (Barlow, 1991), which assume that 
consecutive processing steps lead to a progressive re- 
duction in the redundancy of representation (Olshausen 
and Field, 1996). 

This contribution is an overview of the PCA and 
ICA neuromorphic architectures and their associated 
algorithmic implementations increasingly used as ex- 
ploratory techniques. The discussion is conducted on 
artificially generated sub- and super-Gaussian source 
signals. 



In neural computation, transforming methods amount 
to unsupervised learning, since the representation is 
only learned from data without any external control. 
Irrespective of the nature of learning, the neural adap- 
tation may be formally conceived as an optimization 
problem: an objective function describes the task to be 
performed by the network and a numerical optimization 
procedure allows adapting network parameters (e.g., 
connection weights, biases, internal parameters). This 
process amounts to search or nonlinear programming 
in a quite large parameter space. However, any prior 
knowledge available on the solution might be efficiently 
exploited to narrow the search space. In supervised 
learning, the additional knowledge is incorporated in 
the net architecture or learning rules (Gold, 1996). A 
less extensive research was focused on unsupervised 
learning. In this respect, the mathematical methods 
usually employed are drawn from classical constrained 
multivariate nonlinear optimization and rely on the 
Lagrange multipliers method, the penalty or barrier 
techniques, and the classical numerical algebra tech- 
niques, such as deflation/renormalization (Fiori, 2000), 
the Gram-Schmidt orthogonalization procedure, or the 
projection over the orthogonal group (Yang, 1995). 

PCA and ICA Models 

Mathematically, the linear stationary PC Aand IC Amod- 
els can be defined on the basis of a common data model. 
Suppose that some stochastic processes are represented 

by three random (column) vectors x(t), n(t)ElR iV 

and s([)eR m with zero mean and finite covariance, 

with the components of s(t)= |s 1 (t),s 2 (t),...,s M (t)} 
being statistically independent and at most one Gauss- 
ian. Let A be a rectangular constant full column rank 
NxM matrix with at least as many rows as columns 
(N >M ), and denote by t the sample index (i.e., time 
or sample point) taking the discrete values t = 1, 2, ..., 
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T. We postulate the existence of a linear relationship 
among these variables like: 



x( t )=As( t ) + n(0 = 2>,.( t )a,. +n(t) (D 

Here s(t), x(t), n(t), and A are the sources, 
the observed data, the (unknown) noise in data, and 
the (unknown) mixing matrix, respectively, whereas 

a z , i = 1, 2,...,M are the columns of A. Mixing is sup- 
posed to be instantaneous, so there is no time delay 

between a (latent) source variable s. (t) mixing into 

an observable (data) variable x. (t), with i = 1, 2, ..., 
Mandj = 1, 2, ..., N. 

Consider that the stochastic vector process 

(x(t)}e R N has the mean E |x(t)}= and the covari- 

ance matrix C x = E <jx(t) x(t) r 1 The goal of PC A is 
to identify the dependence structure in each dimension 
and to come out with an orthogonal transform matrix 

W of size LxN from R N to R L , L<N , such that 

the L-dimensional output vector y(t)= Wx(t) suf- 
ficiently represents the intrinsic features of the input 

data, and where the co variance matrix C of (y(t)} 
is a diagonal matrix D with the diagonal elements ar- 
ranged in descending order, d n >d i+li+1 . The restoration 
of (x(t)} from (y(t)}, say (x(t)}, is consequently 

given by x(t)= W T W x(t) (Figure 1). For a given 
L, PCA aims to find an optimal value of W, such as 



to minimize the error function J =E 4|x(t)-x(t)|}. 
The rows in W are the PCs of the stochastic process 

(x(t)}and the eigenvectors c., j =1,2,..., L oftheinput 
covariance matrix C x . The subspace spanned by the 

principal eigenvectors {c 1 ,c 2 ,...,c L } with L<N , is 
called the PCA subspace of dimensionality L. 

The ICA problem can be formulated as following: 

given T realizations of x(t), estimate both the matrix 

A and the corresponding realizations of s(t). In BSS 
the task is somewhat relaxed to finding the waveforms 

{s t (t)} of the sources knowing only the (observed) 



J {xj (t)}. If no suppositions are made about 
b, the additive noise term is omitted in (1). A 



mixtures 

the noise, 

practical strategy is to include noise in the signals as 

supplementary term(s): hence the ICA model (Fig. 2) 

becomes: 



x(t)=As(t)=Xa / s / (t) 



(2) 



The source separation consists in updating an unmix- 
ing matrix B(t), without resorting to any information 
about the spatial mixing matrix A, so that the output vec- 
tor y(t) = B(t) x(t) becomes an estimate y(t) = s(t) 
of the original independent source signals s(t). The 

separating matrix B(t) is divided in two parts deal- 
ing with dependencies in the first two moments, i.e., 

the whitening matrix V(t), and the dependencies in 



Figure 1. Schematic of the PCA model 
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Figure 2. Schematic of the ICA model 
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Figure 3. A simple feed-forward ANN performing PCA and ICA 
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higher-order statistics, i.e., the orthogonal separating 
matrix W(t) in the whitened space (Fig. 2). If we as- 
sume zero-mean observed data x(t), then we get by 
whitening a vector v(t) = V(t) x(t) with decorrelated 

components. The subsequent linear transform W(t) 
seeks the solution by an adequate rotation in the space 

of component densities and yields y (t)= W(t) v(t) 
(Fig. 2). The total separation matrix between the input 

and the output layer turns to be B(t) = W(t) V(t) 
. In the standard stationary case, the whitening and 
the orthogonal separating matrices converge to some 
constant values after a finite number of iterations dur- 



m 



g learning, that is, B (t) -^ B = W V . 



NEURAL IMPLEMENTATIONS 

A neural approach to BSS entails a network that has 
mixtures of the source signals as input and produces 
approximations of the source signals as output (Figure 
3). As a prerequisite, the input signals must be mutu- 
ally uncorrected, a requirement usually fulfilled by 
PCA. The output signals must nevertheless be mutually 
independent, which leads in a natural way from PCA 
to ICA. The higher order statistics required by source 
separation can be incorporated into computations either 
explicitly or by using suitable nonlinearities. ANNs 
better fit the latter approach (Karhunen, 1996). 

The core of the large class of neural adaptive al- 
gorithms consists in a learning rule and its associated 
optimization criterion (objective function). These two 
items differentiate the algorithms, which are actually 
families of algorithms parameterized by the nonlinear 
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function used. An update rule is specified by the itera- 
tive incremental change AW of the rotation matrix W, 
which gives the general form of the learning rule: 



W -> W + AW 

Neural PCA 



(3) 



First, consider a single artificial neuron receiving an 
M-dimensional input vector x. It gradually adapts its 

weight vector w so that the function E if (w r x)| is 
maximized, where E is the expectation with respect to 
the (unknown) probability density of x and f is a con- 
tinuous objective function. The function f is bounded by 
setting constant the Euclidian norm of w. A constrained 
gradient ascent learning rule based on a sequence of 
sample functions for relatively small learning rates 
a (t) is then (Oja, 1995): 



w(t + l)=w(t)+a (t) (l-w(t) T w(t))x(t) g(w(t) T w(t)) 

(4) 

where g = f . Any PCA learning rules tend to find that 
direction in the input space along which the data has 
maximal variance. If all directions in the input space 
have equal variance, the one-unit case with a suitable 
nonlinearity is approximately minimizing the kurtosis 
of the neuron input. It means that the weight vector of 
the unit will be determined by the direction in the input 
space on which the proj ection of the input data is mostly 
clustered and deviates significantly from normality. This 
task is essentially the goal in the PP technique. 

In the case of single layer ANNs consisting of L 
parallel units, with each unit i having the same M- 
element input vector x and its own weight vector 

w z . that together comprise an M x L weight matrix 

W = [w 1? w 2 ,... ,w L ] the following training rule ob- 
tained from (4) is a generalization of the linear PCA 
learning rule (in matrix form): 



W(t+l)=w(t)+a (t) (i-W(t) W(t) r )x(t)s(x(t) r w(t)) 

(5) 



Due to the instability of the above nonlinear Heb- 
bian learning rule for the multi-unit case, a different 
approach based on optimizing two criteria simultane- 
ously was introduced (Oja, 1982): 

W(t + l)=W(t)+^i(t)x(t)^(y(t) T )+y (t) (l-W(t)W(t) T ) 

(6) 

Here |i (t) is chosen positive or negative depending 
on our interest in maximizing or minimizing, respec- 



tively, the objective function J 1 (w. ) = E if (x r w z . )| 
. Similarly, y (t) is another gain parameter that is 
always positive and constrains the weight vectors to 
orthonormality, which is imposed by an appropriate 
penalty function such as: 



7 2 (w,.) = i(l-wfw,.) 2 + ii:(wfw J .) 2 - 

This is the bigradient algorithm, which is iterated until 
the weight vectors have converged with the desired 
accuracy. This algorithm can use normalized Hebbian 
or anti-Hebbian learning in a unified formula. Starting 
from one-unit rule, the multi-unit bigradient algorithm 
can simultaneously extract several robust counterparts 
of the principal or minor eigenvectors of the data co- 
variance matrix (Wang, 1996). 

In the case of multilayered ANNs, the transfer 
functions of the hidden nodes can be expressed by 
radial basis functions (RBF), whose parameters could 
be learnt by a two-stage gradient descent strategy. A 
new growing RBF-node insertion strategy with different 
RBF is used in order to improve the net performances. 
The learning strategy is reported to save computational 
time and memory space in approximation of continuous 
and discontinuous mappings (Esposito et a/., 2000). 

Neural ICA 

Various forms of unsupervised learning have been 
implemented in ANNs beyond standard PCA like non- 
linear PCA and ICA. Data whitening can be neurally 
emulated by PCA with a simple iterative algorithm that 

updates the sphering matrix V(t): 
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V(t + l)=V(t)-ot(t)(w r -l) 



(7) 



After getting the decorrelation matrix V (t ), the ba- 
sic task for ICA algorithms remains to come out with an 

orthogonal matrix W (t ) , which is equivalent to a suit- 
able rotation of the decorrelated data v (t ) = V (t )x (t ) 
aiming to maximize the product of the marginal densities 
of its components. There are various neural approaches 

to estimate the rotation matrix W (t ) . An important class 
of algorithms is based on maximization of network 
entropy (Bell, 1995). The BS nonlinear information 
maximization (infomax) algorithm performs online 
stochastic gradient ascent in mutual information (MI) 
between outputs and inputs of a network. By minimiz- 
ing the MI between outputs, the network factorizes 
the inputs into independent components. Considering 

a network with the input vector x(t), a weight matrix 

W (t ) , and a monotonically transformed output vector 

y = g (Wx + w ), then the resulting learning rule for 
the weights and bias-weights, respectively, are: 



AW = [W r ] _1 + x(l-2y) r and 



Aw =l-2y 
(8) 



oped from the infomax principle satisfying a general 
stability criterion and preserving the simple initial 
architecture of the network. Applying either natural 
or relative gradient (Cardoso, 1996) for optimization, 
their learning rule yields results that compete with 
fixed-point batch computations. 

The equivariant adaptive separation via indepen- 
dence (EASI) algorithm introduced by Cardoso and 
Laheld (1996) is a nonlinear decorrelation method. The 

objective function J (W)= E if (Wx)} is subject to 
minimization with the orthogonal constraint imposed 

on W and the nonlinearity g = f chosen according to 
data kurtosis. Its basic update rule equates to: 

AW = -X (yy T -I + g (y)y r -yg (y T ))\V 

(10) 

Fixed-point (FP) algorithms are searching the 
ICA solution by minimizing mutual information (MI) 
among the estimated components (Hyvarinen, 1997). 
The FastICA learning rule finds a direction w so that 

the projection of w r x maximizes a contrast function 

of the form J G ( w ) = [ E {f ( wTx )]^ E {f ( v )}] with 
v standing for the standardized Gaussian variable. The 
learning rule is basically a Gram- Schmidt-like decor- 
relation method. 



In the case of bounded variables, the interplay 
between the anti-Hebbian term x(l-2y) r and the 

antidecay term [w r ] produces an output density 
that is close to the flat constant distribution, which cor- 
responds to the maximum entropy distribution. Amari, 
Cichocki, and Yang (Amari, 1996) altered the BS in- 
fomax algorithm by using the natural gradient instead 
of the stochastic gradient to reduce the complexity of 
neural computations and significantly improving the 
speed of convergence. The update rule proposed for 
the separating matrix is: 



AW: 



I-g(Wx) (Wx) T 



W 



(9) 



Lee et al. (Lee, 2000) extended to both sub-and 
super-Gaussian distributions the learning rule devel- 



ALGORITHM ASSESSMENT 

We comparatively run both PCA and ICA neural 
algorithms using synthetically generated time series 
additively corrupted with some white noise to alleviate 
strict determinism (Table 1 and Fig. 4.). Neural PCA 
was implemented using the bigradient algorithm since 
it works for both minimization and maximization of 
the criterion J 1 under the normality constraints enforced 
by the penalty function J r 

The neural ICA algorithms were the extended info- 
max of Bell and Sejnowski, a semi-adaptive fixed-point 
fast ICAalgorithm (Hyvarinen & Oja, 1997), an adapted 
variant of EASI algorithm optimized for real data, and 
the extended generalized lambda distribution (EGLD) 
maximum likelihood-based algorithm. 

In the case of artificially generated sources, the ac- 
curacy of separating the latent sources by an algorithm 
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Table 1. The analytical form of the signals sources 



Signal sources 




Modulated sinusoid: S (l) = 2 * sin (t/149)* COS (t/8) 

Square waves: 



S (2) = sign (sin (l2 * t + 9 * cos (2/29))) 



Saw-tooth: 



S(3) = (rem(t,79)-17)/23 

Impulsive curve: 

S (4) = ((rem (t, 23)-ll)/9) 5 

Exponential decay: S (5) = 5 * exp (— t/l2l)* COS (37 * t) 

Spiky noise: 

S (6) = ((rand (l, T )<.5> 2 -l)* log (rand (l,r)) 



Figure 4. Sub-Gaussian (left) and super-Gaussian (right) source signals and their corresponding histograms 
(bottom) 
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performing ICA can be measured by means of some 
quantitative indexes. The first we used was defined as 
the signal-to-interference ratio (SIR): 



signals, times the number of time samples, 
and times the module of the source signals: 



i N 

SIR = — V 10- log 10 



max 



(Q) 2 



Q ; r Q, - max (Q,) 



(11) 



where Q = BA is the overall transforming matrix of 
the latent source components, Q. is the z'-th column 

of Q, max (Q z ) is the maximum element of Q z , and 
N is the number of the source signals. The higher the 
SIR is, the better the separation performance of the 
algorithm. 

A secondly employed index was the distance be- 
tween the overall transforming matrix Q and an ideal 
permutation matrix, which is interpreted as the cross- 
talking error (CTE): 



CTE = ^ 



^max|Q.| 



-Z 



JL \Q-\ 

z- 



i=1 max Q ; . 



(12) 



Above, Q.. is the zj-th element of Q, max|Q.| is 
the maximum absolute valued element of the row z 

in Q, and maxlQ.I is the maximum absolute valued 
element of the column j in Q. A permutation matrix is 
defined so that on each of its rows and columns, only 
one of the elements equals to unity while all the other 
elements are zero. It means that the CTE attains its 
minimum value zero for an exact permutation matrix 
(i.e., perfect decomposition) and goes positively higher 
the more Q deviates from a permutation matrix (i.e., 
decomposition of lower accuracy). 

We defined the relative signal retrieval er- 
ror (SRE) as the Euclidian distance between the 
source signals and their best matching estimated 
components normalized to the number of source 
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(13) 

The lower the SRE is, the better the estimates ap- 
proximate the latent source signals. 

The stabilized version of FastICA algorithm is at- 
tractive by its fast and reliable convergence, and by the 
lack of parameters to be tuned. The natural gradient 
incorporated in the BS extended infomax performs 
better than the original gradient ascent and is compu- 
tationally less demanding. Though the BS algorithm 
is theoretically optimal in the sense of dealing with 
mutual information as objective function, like all neu- 
ral unsupervised algorithms, its performance heavily 
depends on the learning rates and its convergence is 
rather slow. The EGLD algorithm separates skewed 
distributions, even for zero kurtosis. In terms of com- 
putational time, the BS extended infomax algorithm 
was the fastest, FastICA more faithfully retrieved the 
sources among all algorithms under test, while the 
EASI algorithm came out with a full transform matrix 
Q that is the closest to unity. 



FUTURE TRENDS 

Neuromorphic methods in exploratory analysis and 
data mining are rapidly emerging applications of unsu- 
pervised neural training. In recent years, new learning 
algorithms have been proposed, yet their theoretical 
properties, range of optimal applicability, and compara- 
tive assessment have remained largely unexplored. No 
convergence theorems are associated with the training 
algorithms in use. Moreover, algorithm convergence 
heavily depends on the proper choice of the learning 
rate(s) and, even when convergence is accomplished, 
the neural algorithms are relatively slow compared with 
batch-type computations. Nonlinear and nonstationary 
neural ICA is expected to be developed due to ANNs 
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nonalgorithmic processing and their ability to learn 
nonanalytical relationships if adequately trained. 



CONCLUSION 

Both PCA and ICA share some common features like 
aiming at building generative models that are likely 
to have produced the observed data and performing 
information preservation and redundancy reduction. 
In a neuromorphic approach, the model parameters 
are treated as network weights that are changed during 
the learning process. The main difficulty in function 
approximation stems from choosing the network pa- 
rameters that have to be fixed a priori, and those that 
must be learnt by means of an adequate training rule. 
PCA and ICA have major applications in data 
mining and exploratory data analysis, such as signal 
characterization, optimal feature extraction, and data 
compression, as well as the basis of subspace classi- 
fiers in pattern recognition. ICA is much better suited 
than PCA to perform BSS, blind deconvolution, and 
equalization. 
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KEY TERMS 

Artificial Neural Networks (ANNs): An informa- 
tion-processing synthetic system made up of several 
simple nonlinear processing units connected by ele- 
ments that have information storage and programming 
functions adapting and learning from patterns, which 
mimics a biological neural network. 

Blind Source Separation (BSS): Separation of 
latent nonredundant (e.g., mutually statistically inde- 
pendent or decorrelated) source signals from a set of 
linear mixtures, such that the regularity of each result- 
ing signal is maximized, and the regularity between 
the signals is minimized (i.e. statistical independence 
is maximized) without (almost) any information on 
the sources. 

Confirmatory Data Analysis (CD A): An approach 

which, subsequent to data acquisition, proceeds with 
the imposition of a prior model and analysis, estima- 
tion, and testing model parameters. 



Exploratory Data Analysis (EDA): An approach 
based on allowing the data itself to reveal its underly- 
ing structure and model heavily using the collection of 
techniques known as statistical graphics. 

Independent Component Analysis (ICA): An 

exploratory method for separating a linear mixture of 
latent signal sources into independent components as 
optimal estimates of the original sources on the basis 
of their mutual statistical independence and non-Gaus- 
sianity. 

Learning Rule: Weight change strategy in a con- 
nectionist system aiming to optimize a certain obj ective 
function. Learning rules are iteratively applied to the 
training set inputs with error gradually reduced as the 
weights are adapting. 

Principal Component Analysis (PC A): An or- 
thogonal linear transform based on singular value 
decomposition that projects data to a subspace that 
preserves maximum variance. 
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INTRODUCTION 

Fuzzy logic became the core of a different approach 
to computing. Whereas traditional approaches to 
computing were precise, or hard edged, fuzzy logic 
allowed for the possibility of a less precise or softer 
approach (Klir et al., 1995, pp. 212-242). An approach 
where precision is not paramount is not only closer to 
the way humans thought, but may be in fact easier to 
create as well (Jin, 2000). Thus was born the field of 
soft computing (Zadeh, 1994). Other techniques were 
added to this field, such as Artificial Neural Networks 
(ANN), and genetic algorithms, both modeled on bio- 
logical systems. Soon it was realized that these tools 
could be combined, and by mixing them together, they 
could cover their respective weaknesses while at the 
same time generate something that is greater than its 
parts, or in short, creating synergy. 

Adaptive Neuro-fuzzy is perhaps the most prominent 
of these admixtures of soft computing technologies 
(Mitra et al., 2000). The technique was first created 
when artificial neural networks were modified to work 
with fuzzy logic, hence the Neuro-fuzzy name (Jang 
et al., 1997, pp. 1-7). This combination provides fuzzy 
systems with adaptability and the ability to learn. It 
was later shown that adaptive fuzzy systems could be 
created with other soft computing techniques, such 
as genetic algorithms (Yen et al., 1998, pp. 469-490), 
Rough sets (Pal et al., 2003; Jensen et al., 2004, Ang 
et al., 2005) and Bayesian networks (Muller et al., 
1995), but the Neuro-fuzzy name was widely used, so 
it stayed. In this chapter we are using the most widely 
used terminology in the field. 

Neuro-fuzzy is a blanket description of a wide 
variety of tools and techniques used to combine any 
aspect of fuzzy logic with any aspect of artificial neural 



networks. For the most part, these combinations are 
just extensions of one technology or the other. For 
example, neural networks usually take binary inputs, 
but use weights that vary in value from to 1. Adding 
fuzzy sets to ANN to convert a range of input values 
into values that can be used as weights is considered a 
Neuro-fuzzy solution. This chapter will pay particular 
interest to the sub-field where the fuzzy logic rules are 
modified by the adaptive aspect of the system. 

The next part of this chapter will be organized as 
follows: in section 1 we examine models and techniques 
used to combine fuzzy logic and neural networks 
together to create Neuro-fuzzy systems. Section 2 
provides an overview of the main steps involved in the 
development of adaptive Neuro-fuzzy systems. Section 
3 concludes this chapter with some recommendations 
and future developments. 



NEURO-FUZZY TECHNOLOGY 

Neuro-fuzzy Technology is a broad term used to describe 
a field of techniques and methods used to combine 
fuzzy logic and neural networks together (Jin, 2003, 
pp. 111-140). Fuzzy logic and neural networks each 
have their own sets of strengths and weaknesses, and 
most attempts to combine these two technologies have 
the goal of using each techniques strengths to cover 
the others weaknesses. 

Neural networks are capable of self-learning, clas- 
sification and associating inputs with outputs. Neural 
networks can also become a universal function ap- 
proximator (Kosko, 1997, pp. 299; Nauck et al., 1998, 
Nauck et al. 1999). Given enough information about 
an unknown continuous function, such as its inputs 
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and outputs, the neural network can be trained to ap- 
proximate it. The disadvantages of neural networks are 
they are not guaranteed to converge, that is to be trained 
properly, and after they have been trained they cannot 
give any information about why they take a particular 
course of action when given a particular input. 

Fuzzy logic Inference systems can give human 
readable and understandable information about why 
a particular course of action was taken because it is 
governed by a series of IF THEN rules. Fuzzy logic 
systems can adapt in a way that their rules and the pa- 
rameters of the fuzzy sets associated with those rules 
can be changed to meet some criteria. However fuzzy 
logic systems lack the capability for self-learning, 
and must be modified by an external entity. Another 
salient feature of fuzzy logic systems is that they are, 
like artificial neural networks, capable of acting as 
universal approximators. 

The common feature of being able to act as a uni- 
versal approximator is the basis of most attempts to 
merge these two technologies. Not only it can be used 
to approximate a function but it can also be used by both 
neural networks, and fuzzy logic systems to approximate 
each other as well. (Pal et al., 1999, pp. 66) 

Universal approximation is the ability of a system 
to replicate a function to some degree. Both neural 
networks and fuzzy logic systems do this by using a 
non-mathematical model of the system (Jang et al., 
1997, pp. 238; Pal et al., 1999, pp. 19). The term ap- 
proximate is used as the model does not have to match 
the simulated function exactly, although it is sometime 
possible to do so if enough information about the func- 
tion is available. In most cases it is not necessary or 
even desirable to perfectly simulate a function as this 
takes time and resources that may not be available and 
close is often good enough. 

Categories of Neuro-Fuzzy Systems 

Efforts to combine fuzzy logic and neural networks have 
been underway for several years and many methods 
have been attempted and implemented. These methods 
are of two major categories: 



Neural-Fuzzy Systems (NFS): are fuzzy systems 
"augmented" by neural networks (Jin, 2003, 
pp.111-140). 

There also four main architectures used for imple- 
menting neuro-fuzzy systems: 

Fuzzy Multi-layer networks (Jang, 1993; Mitra et 
al., 1995; Mitra et al., 2000; Mamdani et al., 1999; 
Sugeno et al., 1988, Takagi et al., 1985). 
Fuzzy Self -Organizing Map networks (Drobics 
et al., 2000; Kosko, 1997, pp. 98; Haykin, 1999, 
pp. 443) 
• Black-Box Fuzzy ANN (Bellazzi et al., 1999; Qiu, 
2000; Monti, 1996) 

Hybrid Architectures (Zatwarnicki, 2005; Borzem- 
ski et al., 2003; Marichal et al., 2001; Rahmoun et 
al., 2001; Koprinska et al., 2000; Wang et al. 1999; 
Whitfort et al., 1995). 



DEVELOPMENT OF ADAPTIVE 
NEURO-FUZZY SYSTEMS 

Developing an Adaptive Neuro-fuzzy system is a pro- 
cess that is similar to the procedures used to create fuzzy 
logic systems, and neural networks. One advantage of 
this combined approach is that it is usually no more 
complicated than either approach taken individually. 

As noted above, there are two methods of creating 
a Neuro-fuzzy system; integrating fuzzy logic into a 
neural network framework (FNN), and implementing 
neural networks into a fuzzy logic system (NFS). A 
fuzzy neural network is j ust a neural network with some 
fuzzy logic components; hence is generally trained like 
a normal neural network is. 

Training Process: The training regimen for a NFS 
differs slightly from that used to create a neural network 
and a fuzzy logic system in some key ways, while at 
the same time incorporating many improvements over 
those training methods. 

The training process of a Neuro-fuzzy system has 
five main steps: (Von Altrock, 1995, pp. 71-75) 



Fuzzy Neural Networks (FNN) : are neural networks 
that can use fuzzy data, such as fuzzy rules, sets 
and values (Jin, 2003, pp.205-220). 



Obtain Training Data: The data must cover all 
possible inputs and output, and all the critical 
regions of the function if it is to model it in an 
appropriate manner. 



32 



Adaptive Neuro-Fuzzy Systems 



Create a Fuzzy Logic System: The fuzzy system 
may be an existing system which is known to work, 
such as one that has been in production for some 
time or one that has been created by following 
expert system development methodologies. 
Define the Neural Fuzzy Learning: This phase 
deals with defining what you want the system to 
learn. This allows greater control over the learning 
process while still allowing for rule knowledge 
discovery. 

Training Phase: To run the training algorithm. 
The algorithm may have parameters that can be 
adjusted to modify how the system is to be modi- 
fied during training. 

Optimization and Verification: Validation can 
take many forms, but will usually involve feeding 
the system a series of known inputs to determine 
if the system generates the desired output, and or 
is within acceptable parameters. Furthermore, the 
rules and membership functions maybe extracted 
so they can be examined by human experts for 
correctness. 



CONCLUSION AND FUTURE 
DEVELOPMENTS 

Advantages ofANF systems: Although there are many 
ways to implement a Neuro-fuzzy system, the advan- 
tages described for these systems are remarkably uni- 
form across the literature. The advantages attributed to 
Neuro-fuzzy systems as compared to ANNs are usually 
related to the following aspects: 

Faster to train: This is due to the massive num- 
ber of connections present in the ANN, and the 
non-trivial number of calculations associated with 
each. As well, most neural fuzzy systems can be 
trained by going through the data once, whereas a 
neural network may need to be exposed to the same 
training data many times before it converges. 
Less computational resources: Neural fuzzy sys- 
tem is smaller in size and contains fewer internal 
connections than a comparable ANN, hence it is 
faster and use significantly less resources. 
Offer the possibility to extract the rules: This 
is a major advantage over ANNs in that the rules 
governing a system can be communicated to the 
human users in an easily understandable form. 



Limitation ofANF systems: The greatest limitation 
in creating adaptive systems is known as the "Curse of 
Dimensionality", which is named after the exponen- 
tial growth in the number of features that the model 
has to keep track of as the number of input attributes 
increases. Each attribute in the model is a variable in 
the system, which corresponds to an axis in a multidi- 
mensional graph that the function is mapped into. The 
connections between different attributes correspond to 
the number of potential rules in the system as given 
by the formula: 



N , = (L r . ) 

rules v iingusticj^rms 7 



variables 



(Gorrostieta et al., 2006) 



This formula becomes more complicated if there are 
different numbers of linguistic variables (fuzzy sets) 
covering each attribute dimension. Fortunately there are 
ways around this problem. As the neural fuzzy system 
is only approximating the function being modeled, the 
system may not need all the attributes to achieve the 
desired results. 

Another area of criticism in the Neuro-fuzzy field is 
related to aspects that can't be learned or approximated. 
One of the most known aspects here is the caveat at- 
tached to the universal approximation. In fact, the 
function being approximated has to be continuous; a 
continuous function is a function that does not have 
a singularity, a point where it goes to infinity. Other 
functions that Adaptive Neuro-fuzzy systems may have 
problems learning are things like encryption algorithms, 
which are purposely designed to be resistant to this 
type of analysis. 

Future developments: Predicting the future has 
always been hard; however for ANF technology the 
future expansion has been made easy because of the 
widespread use of its basis technology (neural networks 
and fuzzy logic). Mixing of these technologies creates 
synergies as they remediate to each other weaknesses. 
ANF technology allows complex system to be grown 
instead of someone having to build them. 

One of the most promising areas for ANF systems 
is System Mining. There exist many cases where we 
wish to automate a system that cannot be systematically 
described in a mathematical manner. This means there is 
no way of creating a system using classical development 
methodologies (i.e. Programming a simulation.). If we 
have an adequately large set of examples of inputs and 
their corresponding outputs, ANF can be used to get 
a model of the system. The rules and their associated 



33 



Adaptive Neuro-Fuzzy Systems 



fuzzy sets can then be extracted from this system and 
examined for details about how the system works. This 
knowledge can be used to build the system directly. 
One interesting application of this technology is to audit 
existing complex systems. The extracted rules could 
be used to determine if the rules match the exceptions 
of what the system is supposed to do, and even detect 
fraud actions. Alternatively, the extracted model may 
show an alternative, and or more efficient manner of 
implementing the system. 
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KEY TERMS 

Artificial Neural Networks (ANN): An artificial 
neural network, often just called a "neural network" 
(NN), is an interconnected group of artificial neurons 
that uses a mathematical model or computational model 
for information processing based on a connectionist 
approach to computation. Knowledge is acquired by 
the network from its environment through a learning 
process, and interneuron connection strengths (synaptic 
weighs) are used to store the acquired knowledge. 

Evolving Fuzzy Neural Network (EFuNN): An 

Evolving Fuzzy Neural Network is a dynamic archi- 
tecture where the rule nodes grow if needed and shrink 
by aggregation. New rule units and connections can 
be added easily without disrupting existing nodes. 
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The learning scheme is often based on the concept of 
"winning rule node". 

Fuzzy Logic: Fuzzy logic is an application area of 
fuzzy set theory dealing with uncertainty in reasoning. 
It utilizes concepts, principles, and methods developed 
within fuzzy set theory for formulating various forms 
of sound approximate reasoning. Fuzzy logic allows for 
set membership values to range (inclusively) between 
and 1, and in its linguistic form, imprecise concepts 
like "slightly", "quite" and "very". Specifically, it al- 
lows partial membership in a set. 

Fuzzy Neural Networks (FNN): are Neural Net- 
works that are enhanced with fuzzy logic capability such 
as using fuzzy data, fuzzy rules, sets and values. 

Neuro-Fuzzy Systems (NFS): Aneuro-fuzzy sys- 
tem is a fuzzy system that uses a learning algorithm 
derived from or inspired by neural network theory to 
determine its parameters (fuzzy sets and fuzzy rules) 
by processing data samples. 

Self-Organizing Map (SOM): The self-organiz- 
ing map is a subtype of artificial neural networks. It 



is trained using unsupervised learning to produce low 
dimensional representation of the training samples while 
preserving the topological properties of the input space. 
The self-organizing map is a single layer feed-forward 
network where the output syntaxes are arranged in low 
dimensional (usually 2D or 3D) grid. Each input is con- 
nected to all output neurons. Attached to every neuron 
there is a weight vector with the same dimensionality 
as the input vectors. The number of input dimensions 
is usually a lot higher than the output grid dimension. 
SOMs are mainly used for dimensionality reduction 
rather than expansion. 

Soft Computing: Soft Computing refers to a 
partnership of computational techniques in computer 
science, artificial intelligence, machine learning and 
some engineering disciplines, which attempt to study, 
model, and analyze complex phenomena. The principle 
partners at this juncture are fuzzy logic, neuron-com- 
puting, probabilistic reasoning, and genetic algorithms. 
Thus the principle of soft computing is to exploit the 
tolerance for imprecision, uncertainty, and partial truth 
to achieve tractability, robustness, low cost solution, 
and better rapport with reality. 
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INTRODUCTION 

Before the advent of software engineering, the lack 
of memory space in computers and the absence of 
established programming methodologies led early 
programmers to use self-modification as a regular 
coding strategy. 

Although unavoidable and valuable for that class 
of software, solutions using self-modification proved 
inadequate while programs grew in size and complex- 
ity, and security and reliability became major require- 
ments. 

Software engineering, in the 70 's, almost led to the 
vanishing of self -modifying software, whose occurrence 
was afterwards limited to small low-level machine- 
language programs with very special requirements. 

Nevertheless, recent research developed in this area, 
and the modern needs for powerful and effective ways 
to represent and handle complex phenomena in high- 
technology computers are leading self-modification 
to be considered again as an implementation choice 
in several situations. 

Artificial intelligence strongly contributed for this 
scenario by developing and applying non-conventional 
approaches, e.g. heuristics, knowledge representation 
and handling, inference methods, evolving software/ 
hardware, genetic algorithms, neural networks, fuzzy 
systems, expert systems, machine learning, etc. 

In this publication, another alternative is proposed 
for developing Artificial Intelligence applications: the 
use of adaptive devices, a special class of abstractions 
whose practical application in the solution of current 
problems is called Adaptive Technology. 

The behavior of adaptive devices is defined by a 
dynamic set of rules. In this case, knowledge may be 
represented, stored and handled within that set of rules 
by adding and removing rules that represent the addition 
or elimination of the information they represent. 

Because of the explicit way adopted for representing 
and acquiring knowledge, adaptivity provides a very 
simple abstraction for the implementation of artificial 
learning mechanisms: knowledge maybe comfortably 



gathered by inserting and removing rules, and handled 
by tracking the evolution of the set of rules and by inter- 
preting the collected information as the representation 
of the knowledge encoded in the rule set. 



MAIN FOCUS OF THIS ARTICLE 

This article provides concepts and foundations on 
adaptivity and adaptive technology, gives a general 
formulation for adaptive abstractions in use and indi- 
cates their main applications. 

It shows how rule-driven devices may turn into 
adaptive devices to be applied in learning systems 
modeling, and introduces a recently formulated kind of 
adaptive abstractions having adaptive subjacent devices. 
This novel feature may be valuable for implementing 
meta-learning, since it enables adaptive devices to 
change dynamically the way they modify their own 
set of defining rules. 

A significant amount of information concerning 
adaptivity and related subjects may be found at the 
(LTA Web site). 



BACKGROUND 

This section summarizes the foundations of adaptivity 
and establishes a general formulation for adaptive rule- 
driven devices (Neto, 2001), non-adaptivity being the 
only restriction imposed to the subjacent device. 

Some theoretical background is desirable for 
the study and research on adaptivity and Adaptive 
Technology: formal languages, grammars, automata, 
computation models, rule-driven abstractions and 
related subjects. 

Nevertheless, either for programming purposes or 
for an initial contact with the theme, it may be unprob- 
lematic to catch the basics of adaptivity even having no 
prior expertise with computer-theoretical subjects. 

In adaptive abstractions, adaptivity maybe achieved 
by attaching adaptive actions to selected rules chosen 
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from the rule set defining some subjacent non-adap- 
tive device. 

Adaptive actions enable adaptive devices to dynami- 
cally change their behavior without external help, by 
modifying their own set of defining rules whenever 
their subjacent rule is executed. 

For practical reasons, up to two adaptive actions are 
allowed: one to be performed prior to the execution of 
its underlying rule, and the other, after it. 

An adaptive device behaves just as it were piecewise 
non-adaptive : starting with the configuration of its initial 
underlying device, it iterates the following two steps, 
until reaching some well-defined final configuration: 

While no adaptive action is executed, run the 
underlying device; 

Modify the set of rules defining the device by 
executing an adaptive action. 

Rule-Driven Devices 

A rule-driven device is any formal abstraction whose 
behavior is described by a rule set that maps each pos- 
sible configuration of the device into a corresponding 
next one. 

Adevice is deterministic when, for any configuration 
and any input, a single next configuration is possible. 
Otherwise, it is said non-deterministic. 

Non-deterministic devices allow multiple valid 
possibilities for each move, and require backtracking, 
so deterministic equivalents are usually preferable in 
practice. 

Assume that: 

D is some rule-driven device, defined as 

D = (C,R,S,c ,A). 

C is its set of possible configurations. 

i?cCx(Su{8 })x C is the set of rules describ- 
ing its behavior, where 8 denotes empty stimulus, 
representing no events at all. 
S is its set of valid input stimuli. 

c eC is its initial configuration. 
AcC is its set of final configurations. 

Let c t ^> (r) c i+1 (for short, c z .^> c i+1 ) denote the ap- 
plication of some rule r = (q,s,c /+1 )e R to the current 



configuration c. in response to some input stimulus 

seSu{s }, yielding its next configuration q +1 . 

Successive applications of rules in response to a 
stream w e S * of input stimuli, starting from the initial 
configuration c and leading to some final configuration 

c e A is denoted c =>^, c (The star postfix operator in 
the formulae denotes the Kleene closure: its preceding 
element may be re-instantiated or reapplied an arbitrary 
number of times). 

We say that D defines a sentence w if, and only if, 

c =>^ c holdsforsome c e A.ThecollectionL(D)ofall 
such sentences is called the language defined by D: 

L(D)={weS*|c ^* w c,ceA} 



Adaptive (Rule-Driven) Devices 

An adaptive rule-driven device AD = (ND , AM) 
associates an initial subjacent rule-driven device 
ND = (C^Rq^iCq^), to some adaptive mechanism 
AM, that can dynamically change its behavior by modi- 
fying its defining rules. 

That is accomplished by executing non-null adap- 
tive actions chosen from a set AA of adaptive actions, 
which includes the null adaptive action a . 

A built-in counter t starts at and is self-incre- 
mented upon any adaptive actions' execution. Let X 
denote the value of X after j executions of adaptive 
actions by AD. 

Adaptive actions in AA call functions that map AD 
current set AR f of adaptive rules into AR t+1 by inserting 
to and removing adaptive rules ar from AM. 

Let AR be the set of all possible sets of adaptive 

rules for AD. Any a k e A maps the current set of rules 
AR t gAR into AR t+1 eAR: 

a k : AR -> AR 

AMassociates to each rule nr p e NR of AD underlying 
device ND a pair of adaptive actions ba p ,aa p e AA 

AM ^AAxNRxAA 
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Notation 

When writing elementary adaptive actions, ?[ar], + [ar] 

and - [arjrespectively denote searching, inserting and 
eliminating adaptive rules that follow template ar. 

Note that ar may contain references to parameters, 
variables and generators, in order to allow cross-refer- 
encing among elementary adaptive actions inside an 
adaptive function. 

Given an underlying rule nr p e NR, we define an 
adaptive rule ar p e AM as: 

ar p = {ba p ,nr p ,aa p ) 

For each AD move, AM applies some ar p in three 
steps: 

a. execution of adaptive action ba p before applying 
the subjacent rule nr p ; 

b. application of the underlying non-adaptive rule 
nr p ; 

c. execution of adaptive action aa p . 



The following algorithm sketches the overall op- 
eration of AD: 



1. 

2. 
3. 



a. 
b. 

c. 



4. 



5. 



Initialize c , w; 

If w is exhausted, go to 7 else get next event s t ; 
For the current configuration c t , determine the set 
CR of ^-compatible rules; 

if CR = 0, reject w. 

if CR = {{c t ,s,c r )}, apply (c t ,s,c f ) as in steps 

4-6, leading AD to c t+1 = c' . 

if CR = {r k = (c t ,s,c k )\ c k g C, k = l,...,n, n >1 }, 

apply all rules r k in parallel, as in steps 4-6, leading 

AD to c^c 2 ,...,^, respectively. 

If ba p = a , go to 2, else apply first ba p . If rule ar p 
were removed by ba p , go to 3 aborting ar p , else 
AD reached an intermediate configuration, then 
go to 2. 

Apply nr p to the current (intermediate) configura- 
tion, yielding a new intermediate configuration; 



6. Apply aa p , yielding the next (stable) configuration 
for AD; go to 2 

7. If some c t+1 e F was reached, then AD accepts 
w, otherwise AD rejects w; stop. 

Hierarchical Multi-Level Adaptive 
Devices 

Let us define a more elaborated adaptive device by 
generalizing the definition above. Call non-adaptive 
devices level-0 devices; define level-1 devices those 
having subjacent level-0 devices, to each of whose rules 
a pair of level-1 adaptive actions are attached. 

Let the subjacent device be some level-k adaptive 
device. One may construct a level-(/c+l) device attach- 
ing a pair of level-(/c+l) adaptive actions to each of its 
rules. This is the induction step for the definition of hi- 
erarchically structured multi-level adaptive devices. 

Besides the set of rules defining the subjacent level-/c 
device, for k > 0, adaptive functions' subjacent device 
performs at its own level, which may use level-(/c+l) 
adaptive actions to modify the behavior of level-/c 
adaptive functions. 

So, for k > 0, level-(/c+ 1) devices can change the way 
their subjacent level-/c devices modify themselves. That 
also holds for k = 1, since even for k = the (empty) 
set of adaptive functions still exists. 

Notation 

The absence of adaptive actions in non-adaptive rules 
nr is explicitly expressed by stating all level-0 rules r Q in 
the form (a nr a ). Therefore, level-/c rules r k take the 

general format (b k r k _ x a k ), with both £> k and a k level-/c 
adaptive actions for any adaptive level k > . 

So, level-/c adaptive devices have all their defining 
rules stated in the standard form 

(b k (V 1 (-fe(a°(c,a,c')a > 1 )..X. 1 X> 

with 

(6u(...fe(a°(c a, c')a° X)... )a k _ 1 ) 

representing one of the rules defining the subjacent 
level-(/c - 1) adaptive device. 
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Hence, level-/ adaptive actions can modify both the 
set of level-/ adaptive rules and the set of elementary 
adaptive actions defining level-(z - 1) adaptive func- 
tions. 



A SIMPLE ILLUSTRATIVE EXAMPLE 

In the following example, graphical notation is used 
for clarity and conciseness. When drawing automata, 
(as usual) circles represent states; double-line circles 
indicate final states; arrows indicate transitions; labels 
on the arrows indicate tokens consumed by the transition 
and (optionally) an associated adaptive action. When 
representing adaptive functions, automata fragments in 
brackets stand for a group of transitions to be added (+) 
or removed (-) when the adaptive action is applied. 

Figure 1 shows the starting shape of an adaptive 
automaton that accepts a n b 2n c 3n , n>0. At state 1, it 
includes a transition consuming a, which performs 

adaptive actional ). 

Figure 2 defines how J4.( ) operate: 



Figure 1. Initial configuration of the illustrative adap- 
tive automaton 




Using state 2 as reference, eliminate empty transi- 
tions using states x and y 
Add a sequence starting at x, with two transitions 
consuming b 

Append the sequence of two empty transitions 
sharing state 2 

Append a sequence with three transitions consuming c, 
ending at y. 

Figure 3 shows the first two shape changes of this 
automaton after consuming the two first symbols a 
(at state 1) in sentence a 2 b 4 c 6 . In its last shape, the 
automaton trivially consumes the remaining b 4 c 6 , and 
does not change any more. 

There are many other examples of adaptive devices 
in the references. This almost trivial and intuitive case 
was shown here for illustration purposes only. 

Knowledge Representation 

The preceding example illustrates how adaptive devices 
use the set of rules as their only element for represent- 
ing and handling knowledge. 

A rule (here, a transition) may handle parametric 
information in its components (here, the transition's 
origin and destination states, the token labeling the 
transition, the adaptive function it calls, etc.). 

Rules may be combined together in order to represent 
some non-elementary information (here, the sequences 
of transitions consuming tokens "b" and "c" keep track 
of the value of n in each particular sentence). This way, 
rules and their components may work and may be in- 
terpreted as low-level elements of knowledge. 

Although being impossible to impose rules on how 
to represent and handle knowledge in systems repre- 



Figure 2. Adaptive function JA. ( ) 

A()= { ?[ ^ 





40 



Adaptive Technology and Its Applications 



Figure 3. Configurations of the adaptive automaton after executing A( ) once and twice 
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sented with adaptive devices, the details of the learning 
process may be chosen according to the particular needs 
of each system being modeled. 

In practice, the learning behavior of an adaptive 
device may be identified and measured by tracking 
the progress of the set of rules during its operation and 
interpreting the dynamics of its changes. 

In the above example, when transitions are added to 

the automaton by executing adaptive action fA ( ), one 
may interpret the length of the sequence of transitions 
consuming "b" (or "c") as a manifestation of the knowl- 
edge that is being gathered by the adaptive automaton 
on the value of n (its exact value becomes available 
after the sub-string of tokens "a" is consumed). 



FUTURE TRENDS 

Adaptive abstractions represent a significant theoreti- 
cal advance in Computer Science, by introducing and 
exploring powerful non-classical concepts such as: 
time-varying behavior, autonomously dynamic rule 
sets, multi-level hierarchy, static and dynamic adap- 
tive actions. 

Those concepts allow establishing a modeling style, 
proper for describing complex learning systems, for 
efficiently solving traditionally hard problems, for 
dealing with self-modifying learning methods, and 
for providing computer languages and environments 
for comfortable elaboration of quality programs with 
dynamically-variant behavior. 



All those features are vital for conceiving, modeling, 
designing and implementing applications in Artificial 
Intelligence, which benefits from adaptivity while 
expressing traditionally difficult-to-describe Artificial 
Intelligence facts. 

Listed below are features Adaptive Technology 
offers to several fields of Computation, especially to 
Artificial Intelligence-related ones, indicating their 
main impacts and applications. 

Adaptive Technology provides a true computation 
model, constructed around formal foundations. 
Most Artificial Intelligence techniques in use 
are very hard to express and follow since the 
connection between elements of the models and 
information they represent is often implicit, so 
their operation reasoning is difficult for a human 
to track and plan. Adaptive rule-driven devices 
concentrate all stored knowledge in their rules, 
and the whole logic that handles such information, 
in their adaptive actions. Such properties open for 
Artificial Intelligence the possibility to observe, 
understand and control adaptive-device-modeled 
phenomena. By following and interpreting how 
and why changes occur in the device set of rules, 
and by tracking semantics of adaptive actions, one 
can infer the reasoning of the model reactions to 
its input. 

Adaptive devices have enough processing power 
to model complex computations. In (Neto, 2000) 
some well-succeeded use cases are shown with 
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simple and efficient adaptive devices used instead 
of complex traditional formulations. 
Adaptive Devices are Turing Machine-equiva- 
lent computation models that may be used in the 
construction of single-notation full specifications 
of programming languages, including lexical, 
syntactical, context-dependent static-semantic is- 
sues, language built-in features such as arithmetic 
operations, libraries, semantics, code generation 
and optimization, run-time code interpreting, 
etc. 

Adaptive devices are well suited for representing 
complex languages, including idioms. Natural 
language particularly require several features to 
be expressed and handled, as word inflexions, or- 
thography, multiple syntax forms, phrase ordering, 
ellipsis, permutation, ambiguities, anaphora and 
others. A few simple techniques allow adaptive 
devices to deal with such elements, strongly sim- 
plifying the effort of representing and processing 
them. Applications are wide, including machine 
translation, data mining, text- voice and voice-text 
conversion, etc. 

Computer art is another fascinating potential 
application of adaptive devices. Music and other 
artistic expressions are forms of human language. 
Given some language descriptions, computers can 
capture human skills and automatically generate 
interesting outputs. Well-succeeded experiments 
were carried out in the field of music, with excel- 
lent results (Basseto, 1999). 
Decision-taking systems may use Adaptive Deci- 
sion Tables and Trees for constructing intelligent 
systems that accept training patterns, learn how 
to classify them, and therefore, classify unknown 
patterns. Well-succeeded experiments include: 
classifying geometric patterns, decoding sign 
languages, locating patterns in images, generat- 
ing diagnoses from symptoms and medical data, 
etc. 

Language inference uses Adaptive Devices to 
generate formal descriptions of languages from 
samples, by identifying and collecting structural 
information and generalizing on the evidence 
of repetitive or recursive constructs (Matsuno, 
2006). 

Adaptive Devices can be used for learning pur- 
poses by storing as rules the gathered information 
on some monitored phenomenon. In educational 



systems, the behavior of both students and train- 
ers can be inferred and used to decide how to 
proceed. 

One can construct Adaptive Devices whose 
underlying abstraction is a computer language. 
Statements in such languages may be considered 
as rules defining behavior of a program. By at- 
taching adaptive rules to statements, the program 
becomes self-modifiable. Adaptive languages are 
needed for adaptive applications to be expressed 
naturally. For adaptivity to become a true pro- 
gramming style, techniques and methods must be 
developed to construct good adaptive software, 
since adaptive applications developed so far were 
usually produced in strict ad-hoc way. 



CONCLUSION 

Adaptive Technology concerns techniques, methods and 
subjects referring to actual application of adaptivity. 

Adaptive automata (Neto, 1 994) were first proposed 
for practical representation of context-sensitive lan- 
guages (Rubinstein, 1995). Adaptive grammars (Iwai, 
2000) were employed as its generative counterpart 
(Burshteyn, 1990), (Christiansen, 1990), (Cabasino, 
1992), (Shutt, 1993), (Jackson, 2006). 

For specification and analysis of real time reactive 
systems, works were developed based on adaptive 
versions of statecharts (Almeida Jr., 1995), (Santos, 
1997). An interesting confirmation of power and 
usability of adaptive devices for modeling complex 
systems (Neto, 2000) was the successful use of Adap- 
tive Markov Chains in a computer music-generating 
device (Basseto, 1999). 

Adaptive Decision Tables (Neto, 2001) and Adap- 
tive Decision Trees (Pistori, 2006) are nowadays being 
experimented in decision-taking applications. 

Experiments have been reported that explore the 
potential of adaptive devices for constructing language 
inference systems (Neto, 1998), (Matsuno, 2006). 

An important area in which adaptive devices shows 
its strength is the specification and processing of natural 
languages (Neto, 2003). Many other results are being 
achieved while representing syntactical context-depen- 
dencies of natural language. 

Simulation and modeling of intelligent systems are 
other concrete applications of adaptive formalisms, as 
illustrated in the description of the control mechanism 
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of an intelligent autonomous vehicle which collects 
information from its environment and builds maps 
for navigation. 

Many other applications for adaptive devices are 
possible in several fields. 



REFERENCES 

C* or ** - downloadable from LTA Website; ** - in 
Portuguese only) 

Almeida Jr., J.R. (1995)**. STAD - Uma ferramenta 
para representagao e simulagao de sistemas atraves de 
statecharts adaptativos. Sao Paulo, 202p. Doctoral The- 
sis. Escola Politecnica, Universidade de Sao Paulo. 

Basseto, B.A., Neto, J.J. (1999)*. A stochastic musi- 
cal composer based on adaptive algorithms. Anais do 
XIX Congresso Nacional da Sociedade Brasileira de 
Computagao. SBC-99, Vol. 3, pp. 105-13. 

Burshteyn, B. (1990). Generation and recognition 
of formal languages by modifiable grammars. ACM 
SIGPLAN Notices, v.25, n.12, p.45-53, 1990. 

Cabasino, S.; Paolucci, P.S.; Todesco, G.M. (1992). 
Dynamic parsers and evolving grammars. ACM SIG- 
PLAN Notices, v.27, n.ll, p.39-48, 1992. 

Christiansen, H. (1990). A survey of adaptable gram- 
mars. ACM SIGPLAN Notices, v.25, n.ll, p.33-44. 

Iwai, M.K. (2000)**. Um formalismo gramatical ad- 
aptativopara linguagens dependentes de contexto. Sao 
Paulo 2000, 191p. Doctoral Thesis. Escola Politecnica, 
Universidade de Sao Paulo. 

Jackson, Q.T. (2006). Adapting to Babel - Adaptivity 
and context-sensitivity parsing: from a n b n c n to RNA-A 
Thotic Technology Partners Research Monograph. 

LTA Website: http://www.pcs.usp.br/~lta 

Matsuno, LP. (2006)**. Um Estudo do Processo de In- 
ferencia de GramdticasRegulareseLivres de Contexto 
Baseados em Modelos Adaptativos. M.Sc. Dissertation, 
Escola Politecnica, Universidade de Sao Paulo. 

Neto, J.J.; Moraes, M.de. (2003)* Using Adaptive 
Formalisms to Describe Context-Dependencies in 
Natural Language. Computational Processing of the 
Portuguese Language 6th International Workshop, 



PROPOR 2003, LNAI Volume 2721, Faro, Portugal, 
June 26-27, Springer- Verlag, 2003, pp 94-97. 

Neto, J. J. (2001)*. Adaptive Rule-Driven Devices - 
General Formulation and Case Study. Lecture Notes 
in Computer Science. Watson, B.W. and Wood, D. 
(Eds.): Implementation and Application of Automata 
- 6th International Conference, CIAA 2001, Vol.2494, 
Pretoria, South Africa, July 23-25, Springer- Verlag, 
2001, pp. 234-250. 

Neto, J.J. (1994)*. Adaptive automata for context- 
dependent languages. ACM SIGPLAN Notices, v.29, 
n.9, p.115-24, 1994. 

Neto, J.J. (2000)*. Solving Complex Problems Ef- 
ficiently with Adaptive Automata. CIAA 2000 - Fifth 
International Conference on Implementation and Ap- 
plication of Automata - London, Ontario, Canada. 

Neto, J.J., Iwai, M.K. (1998)*. Adaptive automata for 
syntax learning. XXIV Conferencia Latinoamericana 
de Informatica CLET98, Quito - Ecuador, tomo 1, 
pp.135-146. 

Pistori, H.; Neto, J.J.; Pereira, M.C. (2006)* Adaptive 
Non-Deterministic Decision Trees: General Formula- 
tion and Case Study. INFOCOMP Journal of Computer 
Science, Lavras, MG. 

Rubinstein, R.S.; Shutt. J.N. (1995). Self-modifying 
finite automata: An introduction, Information process- 
ing letters, v.56, n.4, 24, p. 185-90. 

Santos, J.M.N. (1997)**. Um formalismo adaptativo 
com mecanismos de sincronizagao para aplicagoes 
concorrentes. Sao Paulo, 98p. M.Sc. Dissertation. 
Escola Politecnica, Universidade de Sao Paulo. 

Shutt, J.N. (1993). Recursive adaptable grammar. 
M.S. Thesis, Computer Science Department, Worcester 
Polytechnic Institute, Worcester MA. 

KEY TERMS 

Adaptivity: Property exhibited by structures that 
dynamically and autonomously change their own be- 
havior in response to input stimuli. 

Adaptive Computation Model: Turing-powerful 
abstraction that mimic the behavior of potentially self- 
modifying complex systems. 
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Adaptive Device: Structure with dynamic be- 
havior, with some subjacent device and an adaptive 
mechanism. 

Adaptive Functions and Adaptive Actions: Adap- 
tive actions are calls to adaptive functions, which can 
determine changes to perform on its layer's rule set 
and on their immediately subjacent layer's adaptive 
functions. 

Adaptive Mechanism: Alteration discipline as- 
sociated to an adaptive device's rule set that change 
the behavior of its subjacent device by performing 
adaptive actions. 

Adaptive Rule-Driven Device: Adaptive device 
whose behavior is defined by a dynamically changing 
set of rules, e.g. adaptive automata, adaptive gram- 
mars, etc. 

Context-Dependency: Reinterpretation of terms, 
due to conditions occurring elsewhere in a sentence, e.g. 
agreement rules in English, type-checking in Pascal. 



Context-Sensitive (-Dependent) Formalism: 

Abstraction capable of representing Chomsky type-1 
or type-0 languages. Adaptive Automata and Adaptive 
Context-free Grammars are well suited to express such 
languages. 

Hierarchical (Multilevel) Adaptive Device: 

Stratified adaptive structures whose involving layer's 
adaptive actions can modify both its own layer's rules 
and its underlying layer's adaptive functions. 

Subjacent (or Underlying) Device: Any device 
used as basis to formulate adaptive devices. The in- 
nermost of a multilevel subjacent device must be 
non-adaptive. 
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INTRODUCTION 

Since its introduction to the research community in 
1988, the Cellular Neural Network (CNN) (Chua 
& Yang, 1988) paradigm has become a fruitful soil 
for engineers and physicists, producing over 1,000 
published scientific papers and books in less than 20 
years (Chua & Roska, 2002), mostly related to Digital 
Image Processing (DIP). This Artificial Neural Net- 
work (ANN) offers a remarkable ability of integrating 
complex computing processes into compact, real-time 
programmable analogic VLSI circuits as the ACE16k 
(Rodriguez etal., 2004) and, more recently, into FPGA 
devices (Perko et a/., 2000). 

CNN is the core of the revolutionary Analogic 
Cellular Computer (Roska et a/., 1999), a program- 
mable system based on the so-called CNN Universal 
Machine (CNN-UM) (Roska & Chua, 1993). Analogic 
CNN computers mimic the anatomy and physiology of 
many sensory and processing biological organs (Chua 
& Roska, 2002). 

This article continues the review started in this 
Encyclopaedia under the title Basic Cellular Neural 
Network Image Processing. 



BACKGROUND 

The standard CNN architecture consists of an M x N 
rectangular array of cells C(i,j) with Cartesian coordi- 
nates (i,j), i = 1, 2, ..., M, j = 1, 2, ..., N. Each cell or 
neuron C(i,j) is bounded to a sphere of influence S r (i,j) 
of positive integer radius r, defined by: 



of a cell. When r > N 12 and M = N, a fully connected 
CNN is obtained, a case that corresponds to the classic 
Hopfield ANN model. 

The state equation of any cell C(i,j) in the M x N 
array structure of the standard CNN may be described 
by: 



dt 



; z ls (i)+ £ [M.i,j;k,l)y kl (.t) + B(.i,j;k,l)x kl ]+I lj 

C(t,l)«S r (IJ) 

(2) 



where C and R are values that control the transient 
response of the neuron circuit (just like an RC filter), I 
is generally a constant value that biases the state matrix 
Z = {z..}, and S is the local neighbourhood defined in 

ij ' r ° 

(1), which controls the influence of the input data X = 
{x.} and the network output Y = {y.} for time t. 

This means that both input and output planes interact 
with the state of a cell through the definition of a set of 
real-valued weights, A(i,j; k, I) and B(i,j; k, /), whose 
size is determined by r. The cloning templates A and 
B are called the feedback and feed-forward operators, 
respectively. 

An isotropic CNN is typically defined with constant 
values for r, I, A and B, implying that for an input image 
X, a neuron C(i,j) is provided for each pixel (z, j), with 
constant weighted circuits defined by the feedback and 
feed-forward templates A and B. The neuron state value 
z.. is adjusted with the bias parameter I, and passed as 
input to an output function of the form: 



^=7<H (t)+1 l-MM) 



(3) 



S r (i,j) = \C(k,l) 



max 

l<k<M,l</<IV 



{|/c-z|,|/-j|}<r 



(1) 



This set is referred as a (2r +1) x (2r +1) neigh- 
bourhood. The parameter r controls the connectivity 



The vast majority of the templates defined in the 
CNN-UM template compendium of (Chua & Roska, 
2002) are based on this isotropic scheme, using r = 1 
and binary images in the input plane. If no feedback 
(i.e. A = 0) is used, then the CNN behaves as a convolu- 
tion network, using B as a spatial filter, I as a threshold 
and the piecewise linear output (3) as a limiter. Thus, 
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virtually any spatial filter from DIP theory can be 
implemented on such a feed-forward CNN, ensuring 
binary output stability via the definition of a central 
feedback absolute value greater than 1. 



ADVANCED CNN IMAGE PROCESSING 

In this section, a description of more complex CNN 
models is performed in order to provide a deeper insight 
into CNN design, including multi-layer structures and 
nonlinear templates, and also to illustrate its powerful 
DIP capabilities. 

Nonlinear Templates 

A problem often addressed in DIP edge detection is the 
robustness against noise (Jain, 1989). In this sense, the 
EDGE CNN detector for grey-scale images given by 



A = 2, B F 



1 


-1 


-1" 


1 


8 


-1 


1 


-1 


-1 



, I = -0.5 



(4) 



is a typical example of a weak-against-noise filter, as a 
result of fixed linear feed-forward template combined 
with excitatory feedback. One way to provide the 
detector with more robustness against noise is via the 
definition of a nonlinear B template of the form: 



Br 



b b b 
b b 
b b b 



where b ■ 



1 0.5 |x, 7 -x H |>rh 



(5) 



This nonlinear template actually defines different 
coefficients for the surrounding pixels prior to perform 
the spatial filtering of the input image X. Thus, a CNN 
defined with nonlinear templates is generally dependent 
of X, and can not be treated as an isotropic model. 

Just two values for the surrounding coefficients of B 
are allowed: one excitatory for greater than a threshold 
th luminance differences with the central pixel (i.e. edge 
pixels), and the other inhibitory, doubled in absolute 
value, for similar pixels, where th is usually set around 



0.5. The feedback template A = 2 remains unchanged, 
but the value for the bias I must be chosen from the 
following analysis: 

For a given state z element, the contribution w.. 

° ij m ' v 

of the feed-forward nonlinear filter of (5) may be 
expressed as: 



Wg=-1.0-p s +0.5-p c 

= -(8-p e )+0.5.p e 
= -8 + 1.5- p e 



(6) 



where p s is the number of similar pixels in the 3 x 3 
neighbourhood and p e the rest of edge pixels. E.g. if 
the central pixel has 8 edge neighbours, w.. = 12 - 8 = 
4, whereas if all its neighbours are similar to it, then 
w.. = -8. Thus, a pixel will be selected as edge depend- 
ing on the number of its edge neighbours, providing 
the possibility of noise reduction. For instance, edge 
detection for pixels with at least 3 edge neighbours 
forces that I e (4, 5). 

The main result is that the inclusion of nonlinearities 
in the definition of B coefficients and, by extension, 
the pixel- wise definition of the main CNN parameters 
gives rise to more powerful and complex DIP filters 
(Chua & Roska, 1993). 

Morphologic Operators 

Mathematical Morphology is an important contributor 
to the DIP field. In the classic approach, every morpho- 
logic operator is based on a series of simple concepts 
from Set Theory. Moreover, all of them can be divided 
into combinations of two basic operators: erosion and 
dilation (Serra, 1982). Both operators take two pieces of 
data as input: the binary input image and the so-called 
structuring element, which is usually represented by 
a 3x3 template. 

A pixel belongs to an object if it is active (i.e. its 
value is 1 or black), whereas the rest of pixels are 
classified as background, zero-valued elements. Basic 
morphologic operators are defined using only object 
pixels, marked as 1 in the structuring element. If a 
pixel is not used in the match, it is left blank. Both 
dilation and erosion operators may be defined by the 
structuring elements 
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(7) 



for 8 or 4-neighbour connectivity, respectively. In 
dilation, the structuring element is placed over each 
input pixel. If any of the 9 (or 5) pixels considered in 
(7) is active, then the output pixel will be also active 
(Jain, 1989). The erosion operator can be defined as 
the dual of dilation, i.e. a dilation performed over the 
background. 

More complex morphologic operators are based 
on structuring elements that also contains background 
pixels. This is the case of the Hit and Miss Transform 
(HMT), a generalized morphologic operator used to 
identify certain local pixel configurations. For instance, 
the structuring elements defined by 
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and 
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(8) 



are used to find 90° convex corner object pixels within 
the image. Apixel will be selected as active in the output 
image if its local neighbourhood exactly matches with 
that defined by the structuring element. However, in 
order to calculate a full, non-orientated corner detector 
it will be necessary to perform 8 HMT, one for each 
rotated version of (8), OR-ing the 8 intermediate output 
images to obtain the final image (Fisher et a/., 2004). 
In the CNN context, the HMT may be obtained in 
a straightforward manner by: 



A = 2, B J 



rr • "/; 



1 



S H 



1 



otherwise 



I = 0.5-p 5 



(9) 



where S = {s..} is the structuring element andp s is the 
total number of active pixels in it. 

Since the input template B of the HTM CNN is 
defined via the structuring element S, and given that 
there are 2 9 = 512 distinct 3x3 possible structuring 
elements, there will also be 512 different hit-and-miss 
erosions. For achieving the opposite result, i.e. hit-and- 
miss dilation, the threshold must be the opposite of that 
in (9) (Chua & Roska, 2002). 



Dynamic Range Control CNN and 
Piecewise Linear Mappings 

DIP techniques can be classified by the domain where 
they operate: the image or spatial domain or the 
transform domain (e.g. the Fourier domain). Spatial 
domain techniques are those who operate directly over 
the pixels within an image (e.g. its intensity level). A 
generic spatial operator can be defined by 



Y(Uj) = T[X(Uj)\ 



(10) 



where X and Y are the input and output images, re- 
spectively, and T is a spatial operator defined over a 
neighbourhood S r around each pixel X(i,j), as defined 
in (1). Based on this neighbourhood, spatial operators 
can be grouped into two types: Single Point Process- 
ing Operators, also known as Mapping Operators, and 
Local Processing Operators, which can be defined by 
a spatial filter (i.e. 2D-discrete convolution) mask 
(Jain, 1989). 

The simplest form of Tis obtained when S r is 1 pixel 
size. In this case, Yonly depends of the intensity value 
of X for every pixel and T becomes an intensity level 
transformation function, or mapping, of the form 



s = T(r) 



(11) 



where r and s are variables that represent grey 
level in X and Y, respectively. 

According to this formulation, mappings can be 
achieved by direct application of a function over a 
range of input intensity levels. By properly choosing 
the form of T, a number of effects can be obtained, as 
the grey-level inversion, dynamic range compression 
or expansion (i.e. contrast enhancement), and threshold 
binarization for obtaining binary masks used in analysis 
and morphologic DIP. 

A mapping is linear if its function T is also linear. 
Otherwise, Tis not linear and the mapping is also non- 
linear. An example of nonlinear mapping is the CNN 
output function (3). It consists of three linear segments: 
two saturated levels, -1 and +1, and the central linear 
segment with unitary slope that connects them. This 
function is said to be piecewise linear and is closely 
related to the well-known sigmoid function utilized in 
the Hopfield ANN (Chua & Roska, 1993). It performs 
a mapping of intensity values stored in Z in the [-1, 
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+1] range. The bias I controls the average point of the 
input range, where the output function gives a zero- 
valued outcome. 

Starting from the original CNN cell or neuron 
(l)-(3), a brief review of the Dynamic Range Control 
(DRC) CNN model first defined in (Fernandez et a/., 
2006) follows. This network is designed to perform a 
piecewise linear mapping T over X, with input range 
[m-d, m+d] and output range [a, b]. Thus, 



r[*(U)] = 



h -±(x(i,j)-m) + ^- m-d<X(i,])<m + d 



-co <X(i,j)< m-d 

i-d<X(i,j)<m + d 

m + d <X (i, j)<+oo 



(12) 



a = 0,b = l,m = (g + f)/2,d = (g-f)/2 

(16) 

The DRC network can be easily applied to a first 
order piecewise polynomial approximation of nonlinear, 
continuous mappings. One of the valid possibilities is 
the multi-layer DRC CNN implementation of error- 
controlled Chebyshev polynomials, as described in 
(Fernandez etal, 2006). The possible mappings include, 
among many others, the absolute value, logarithmic, 
exponential, radial basis and integer and real-valued 
power functions. 



FUTURE TRENDS 



In order to be able to implement this function in 
a multi-layer CNN, the following constraints must 
be met: 



|b-a|<2 and d <1 



(13) 



A CNN cell which controls the desired input range 
can be defined with the following parameters: 



A 1 = 0, B x = lid, 7j = -m/d 



(14) 



This network performs a linear mapping between 
[m-d, m+d] and [-1,+1]. Its output is the input of a 
second CNN whose parameters are: 



A 2 = 0, B = (b - a)/2, I=(b + a)/2 



(15) 



The output of this second network is exactly the 
mapping T defined in (12) bounded by the constraints 
of (13). 

One of the simplest techniques used in grey-scale 
image contrast enhancement is contrast stretching or 
normalization. This technique maximizes the dynamic 
range of the intensity levels within the image from 
suitable estimates of the maximum and minimum in- 
tensity values (Fisher et a/., 2004). Thus, in the case 
of normalized grey-scale images, where the minimum 
(i.e. black) and maximum (i.e. white) intensity levels 
are represented by and 1 values, respectively; if such 
an image with dynamic intensity range [f, g] c [0, +1] 
is fed in the input of the 2-layer CNN defined by (14) 
and (15), the following parameters will achieve the 
desired linear dynamic range maximization: 



There is a continuous quest by engineers and special- 
ists: compete with and imitate nature, especially some 
"smart" animals. Vision is one particular area which 
computer engineers are interested in. In this context, the 
so-called Bionic Eye (Werblin et a/., 1995) embedded 
in the CNN-UM architecture is ideal for implementing 
many spatio-temporal neuromorphic models. 

With its powerful image processing toolbox and 
a compact VLSI implementation (Rodriguez et a/., 
2004), the CNN-UM can be used to program or mimic 
different models of retinas and even combinations of 
them. Moreover, it can combine biologically based 
models, biologically inspired models, and analogic 
artificial image processing algorithms. This combina- 
tion will surely bring a broader kind of applications 
and developments. 



CONCLUSION 

A number of other advances in the definition and 
characterization of CNN have been researched in the 
past decade. This includes the definition of methods 
for designing and implementing larger than 3x3 neigh- 
bourhoods in the CNN-UM (Kek & Zarandy, 1998), 
the CNN implementation of some image compression 
techniques (Venetianer et a/., 1995) or the design of 
a CNN-based Fast Fourier Transform algorithm over 
analogic signals (Perko et a/., 1998), between many 
others. 

In this article, a general review of the main properties 
and features of the Cellular Neural Network model has 
been addressed focusing on its DIP applications. The 
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CNN is now a fundamental and powerful toolkit for 
real-time nonlinear image processing tasks, mainly due 
to its versatile programmability, which has powered its 
hardware development for visual sensing applications 
(Roska et a/., 1999). 
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KEY TERMS 

Bionics: The application of methods and systems 
found in nature to the study and design of engineering 
systems. The word seems to have been formed from 
"biology" and "electronics" and was first used by J. 
E. Steele in 1958. 

Chebyshev Polynomial: An important type of 
polynomials used in data interpolation, providing the 
best approximation of a continuous function under the 
maximum norm. 

Dynamic Range: A term used to describe the ratio 
between the smallest and largest possible values of a 
variable quantity. 

FPGA: Acronym that stands for Field-Program- 
mable Gate Array, a semiconductor device invented 
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in 1984 by R. Freeman that contains programmable 
interfaces and logic components called "logic blocks" 
used to perform the function of basic logic gates (e.g. 
XOR) or more complex combination functions such 
as decoders. 

Piece wise Linear Function: A function f(x) that 
can be split into a number of linear segments, each of 
which is defined for a non-overlapping interval of x. 

Spatial Convolution: A term used to identify the 
linear combination of a series of discrete 2D data (a 
digital image) with a few coefficients or weights. In 
the Fourier theory, a convolution in space is equivalent 
to (spatial) frequency filtering. 

Template: Also known as kernel, or convolution 
kernel, is the set of coefficients used to perform a spa- 
tial filter operation over a digital image via the spatial 
convolution operator. 

VLSI: Acronym that stands for Very Large Scale 
Integration. It is the process of creating integrated cir- 
cuits by combining thousands (nowadays hundreds of 
millions) of transistor-based circuits into a single chip. 
Atypical VLSI device is the microprocessor. 
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INTRODUCTION 

An intelligent system is a system that has, similar to 
a living organism, a coherent set of components and 
subsystems working together to engage in goal-driven 
activities. In general, an intelligent system is able to 
sense and respond to the changing environment; gather 
and store information in its memory; learn from earlier 
experiences; adapt its behaviors to meet new challenges; 
and achieve its pre-determined or evolving objectives. 
The system may start with a set of predefined stimulus- 
response rules. Those rules maybe revised and improved 
through learning. Anytime the system encounters a 
situation, it evaluates and selects the most appropriate 
rules from its memory to act upon. 

Most human organizations such as nations, 
governments, universities, and business firms, can 
be considered as intelligent systems. In recent years, 
researchers have developed frameworks for building 
organizations around intelligence, as opposed 
to traditional approaches that focus on products, 
processes, or functions (e.g., Liang, 2002; Gupta and 
Sharma, 2004). Today's organizations must go beyond 
traditional goals of efficiency and effectiveness; they 
need to have organizational intelligence in order to adapt 
and survive in a continuously changing environment 
(Liebowitz, 1999). The intelligent behaviors of those 
organizations include monitoring of operations, 
listening and responding to stakeholders, watching 
the markets, gathering and analyzing data, creating 
and disseminating knowledge, learning, and effective 
decision making. 

Modeling intelligent systems has been a challenge 
for researchers. Intelligent systems, in particular, 
those involve multiple intelligent players, are complex 



systems where system dynamics does not follow 
clearly defined rules. Traditional system dynamics 
approaches or statistical modeling approaches rely on 
rather restrictive assumptions such as homogeneity of 
individuals in the system. Many complex systems have 
components or units which are also complex systems. 
This fact has significantly increased the difficulty of 
modeling intelligent systems. Agent-based modeling 
of complex systems such as ecological systems, stock 
market, and disaster recovery has recently garnered 
significant research interest from a wide spectrum of 
fields from politics, economics, sociology, mathematics, 
computer science, management, to information systems. 
Agent-based modeling is well suited for intelligent 
systems research as it offers a platform to study systems 
behavior based on individual actions and interactions. In 
the following, we present the concepts and illustrate how 
intelligent agents can be used in modeling intelligent 
systems. 

We start with basic concepts of intelligent agents. 
Then we define agent-based modeling (ABM) and 
discuss strengths and weaknesses of ABM. The next 
section applies ABM to intelligent system modeling. We 
use an example of technology diffusion for illustration. 
Research issues and directions are discussed next, 
followed by conclusions. 



INTELLIGENT AGENT 

Intelligent agents, also known as software agents, are 
computer applications that autonomously sense and 
respond to environment in the pursuit of certain designed 
objectives (Wooldridge and Jennings, 1995). Intelligent 
agents exhibit some level of intelligence. They can be 



Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. 



Agent-Based Intelligent System Modeling 



used to assist the user in performing non-repetitive tasks, 
such as seeking information, shopping, scheduling, 
monitoring, control, negotiation, and bargaining. 

Intelligent agents may come in various shapes and 
forms such as knowbots, softbots, taskbots, personal 
agents, shopbots, information agents, etc. No matter 
what shape or form they have, intelligent agents exhibit 
one or more of the following characteristics: 

Autonomous: Being able to exercise control over 

their own actions. 

Adaptive/Learning: Being able to learn and 

adapt to their external environment. 

Social: Being able to communicate, bargain, 

collaborate, and compete with other agents on 

behalf of their masters (users). 

Mobile: Being able to migrate themselves from 

one machine/system to another in a network, such 

as the Web. 

Goal-oriented: Being able to act in accordance 

with built-in goals and objectives. 

Communicative: Being able to communicate 

with people or other agents thought protocols 

such as agent communication language (ACL). 

Intelligent: Being able to exhibit intelligent 

behavior such as reasoning, generalizing, learning, 

dealing with uncertainty, using heuristics, and 

natural language processing. 



AGENT-BASED MODELING 

Using intelligent agents and their actions and 
interactions in a given environment to simulate the 
complex dynamics of a system is referred to as agent- 
based modeling. ABM research is closely related to the 
research in complex systems, emergence, computational 
sociology, multi agent systems, evolutionary 
programming, and intelligent organizations. In ABM, 
system behavior results from individual behaviors and 
collective behaviors of the agents. Researchers of ABM 
are interested in how macro phenomena are emerging 
from micro level behaviors among a heterogeneous 
set of interacting agents (Holland, 1992). Every agent 
has its attributes and its behavior rules. When agents 
encounter in the agent society, each agent individually 
assesses the situation and makes decisions on the basis 
of its behavior rules. In general, individual agents do 



not have global awareness in the multi-agent system. 

Agent-based modeling allows a researcher to set 
different parameters and behavior rules of individual 
agents. The modeler makes assumptions that are most 
relevant to the situation at hand, and then watches 
phenomena emerge from the interactions of the agents. 
Various hypotheses can be tested by changing agent 
parameters and rules. The emergent collective pattern 
of the agent society often leads to results that may not 
have been predicated. 

One of the main advantages of ABM over traditional 
mathematical equation based modeling is the ability 
to model individual styles and attributes, rather than 
assuming homogeneity of the whole population. 
Traditional models based on analytical techniques often 
become intractable as the systems reach real-world level 
of complexity. ABM is particularly suitable for studying 
system dynamics that are generated from interactions 
of heterogeneous individuals. In recent years, ABM has 
been used in studying many real world systems, such 
as stock markets (Castiglione 2000), group selection 
(Pepper 2000), and workflow and information diffusion 
(Neri 2004). Bonabeau (2002) presents a good summary 
of ABM methodology and the scenarios where ABM 
is appropriate. 

ABM is, however, not immune from criticism. Per 
Bonabeau (2002), "an agent-based model will only 
be as accurate as the assumptions and data that went 
into it, but even approximate simulations can be very 
valuable". It has also been observed that ABM relies on 
simplified models of rule-based human behavior that 
often fail to take into consideration the complexity of 
human cognition. Besides, it suffers from "unwrapping" 
problem as the solution is built into the program and 
thus prevents occurrence of new or unexpected events 
(Macy, 2002). 



ABM FOR INTELLIGENT SYSTEMS 

An intelligent system is a system that can sense and 
respond to its environment in pursuing its goals 
and objectives. It can learn and adapt based on past 
experience. Examples of intelligent systems include, 
but not limited to, the following: biological life such 
as human beings, artificial intelligence applications, 
robots, organizations, nations, projects, and social 
movements. 
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Walter Fritz (1997) suggests that the key components 
of an intelligent system include objectives, senses, 
concepts, growth of a concept, present situation, 
response rules, mental methods, selection, actions, 
reinforcement, memory and forgetting, sleeping, 
and patterns (high level concepts). It is apparent that 
traditional analytical modeling techniques are not 
able to model many of the components of intelligent 
systems, let alone the complete system dynamics. 
However, ABM lends itself well to such a task. All 
those components can be models as agents (albeit 
some in abstract sense). An intelligent system is thus 
made of inter-related and interactive agents. ABM is 
especially suitable for intelligent systems consist of a 
large number of heterogeneous participants, such as a 
human organization. 



designed or empirically grounded. In practice, a study 
may start with simple models, often with designed 
agents and environments, to explore certain specific 
dynamics of the system. 

The design model is refined through the calibration 
process, in which design parameters are modified to 
improve the desired characteristics of the model. The 
final step in the modeling process is validation where 
we check the agent individual behavior, interactions, 
and emergent properties of the system against expected 
design features. Validation usually involves comparison 
of model outcomes, often at the macro-level, with 
comparable outcomes in the real world (Midgley, el 
at., 2007). Figure 1 shows the complete modeling 
process. A general tutorial on ABM is given by Macal 
and North (2005). 



Modeling Processes 



ABM for Innovation Diffusion 



Agent-based modeling for intelligent systems starts with 
a thorough analysis of the intelligent systems. Since 
the system under consideration may exhibit complex 
behaviors, we need to identify one or a few key features 
to focus on. Given a scenario of the target intelligent 
system, we first establish a set of objectives that we 
aim to achieve via the simulation of the agent-based 
representation of the intelligent system. The obj ectives 
of the research can be expressed as a set of questions 
to which we seek answers (Doran, 2006). 

A conceptual model is created to lay out the 
requirements for achieving the obj ectives. This includes 
defining the entities, such as agents, environment, 
resources, processes, and relationships. The conceptual 
modeling phase answers the question of what — what 
are needed. The design model determines how the 
requirements can be implemented, including defining 
the features and relevant behaviors of the agents 
(Brown, 2006). 

Depending on the goals of a particular research, a 
model may involve the use of designed or empirically 
grounded agents. Designed agents are those endowed 
with characteristics and behaviors that represent 
conditions for testing specific hypotheses about the 
intelligent systems. When the agents are empirically 
grounded, they are used to represent real world entities, 
such as individuals or processes in an organization. 
Empirically grounded agents are feasible only when 
data about the real world entities are available. Similarly, 
the environment within which the agents act can be 



We present an example of using agent-based intelligent 
system modeling for studying the acceptance and 
diffusion of innovative ideas or technology. Diffusion 
of innovation has been studied extensively over the 
last few decades (Rogers, 1995). However, traditional 
research in innovation diffusion has been grounded on 
case based analysis and analytical systems modeling 



Figure 1. Agent-based modeling process 
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(e.g., using differential and difference equations). 
Agent-based modeling for diffusion of innovation is 
relatively new. Our example is adopted from a model 
created by Michael Samuels (2007), implemented with 
a popular agent modeling system — NetLogo. 

The objective of innovation diffusion modeling is 
to answer questions such as how an idea or technology 
is adopted in a population, how different people (e.g., 
innovators, early adopters, and change agents) influence 
each other, and under what condition an innovation 
will be accepted or rejected by the population. In the 
conceptual modeling, we identify various factors that 
influence an individual's propensity for adopting the 
innovation. Those factors are broadly divided into to two 
categories: internal influences (e.g., word-of-mouth) 
and external influences (e.g. mass media). Any factor 
that exerts its influence through individual contact is 
considered internal influence. 

Individuals in the target population are divided 
into four groups: adopter, potential (adopter), change 
agent, and disrupter. Adopters are those who have 
adopted the innovation, while potentials are those 
who have certain likelihood to adopt the innovation. 
Change agents are the champions of the innovation. 
They are very knowledgeable and enthusiastic about the 
innovation, and often play a critical role in facilitating 
its- diffusion. Disrupters are those who play an opposite 
role of change agents. They are against the current 
innovation, oftentimes because they favor an even 



Figure 2. Agents and influences 
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newer and perceived better innovation. The four groups 
of agents and their relationships are depicted in Figure 
2. It is common, although not necessary, to assume that 
those four groups make up the entire population. 

In a traditional diffusion model, such as the Bass 
model (Bass, 1996), the diffusion rate depends only 
on the number of adopters (and potential adopters, 
given fixed population size). Characteristics of 
individuals in the population are ignored. Even in those 
models where it is assumed that potential adopters 
have varying threshold for adopting an innovation 
(Abrahamson and Rosenkopf, 1997), the individuality 
is very limited. However, in agent-based modeling, the 
types of individuals and individual characteristics are 
essentially unbounded. For example, we can divide 
easily adopters into innovators, early adopters, and 
late adopters, etc. If necessary, various demographic 
and social-economic features can be bestowed to 
individual agents. Furthermore, both internal influence 
and external influence can be further attributed to more 
specific causes. For example, internal influence through 
social networks can be divided into traditional social 
networks that consists friends and acquaintances and 
virtual social networks formed online. Table 1 lists 
typical factors that affect the propensity of adopting 
an innovation. 

An initial study of innovation diffusion, such as the 
one in Michael Samuels (2007), can simply aggregate 
all internal influences into "word-of-month" and all 
external influences into mass media. Each potential 
adopter's tendency of converting to an adopter is 
influenced by chance encounter with other agents. If a 
potential adopter meets a change agent, who is an avid 
promoter of the innovation, he would become more 
knowledgeable about the advantages of the innovation, 
and more likely to adopt. An encounter with a disrupter 



Tablel. Typical internal and external influences 
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creates the opposite effect, as a disrupter favors a 
different type of innovation. 

In order for the simulated model to accurately reflect 
a real-world situation, the model structure and parameter 
values should be carefully selected. For example, we 
need to decide how much influence each encounter 
will result; what is the probability of encountering a 
change agent or a disrupter; how much influence is 
coming from the mass media, etc. We can get these 
values through surveys, statistical analysis of empirical 
data, or experiments specifically designed to elicit data 
from real world situations. 



TRENDS AND RESEARCH ISSUES 

As illustrated through the example of modeling the 
diffusion of innovation in an organization, industry, or 
society, agent-based modeling can be used to model 
the adaptation of intelligent systems that consist of 
intelligent individuals. As most intelligent systems 
are complex in both structure and system dynamics, 
traditional modeling tools that require too many 
unrealistic assumptions have become less effective 
in modeling intelligent systems. In recent years, 
agent-based modeling has found a wide spectrum of 
applications such as in business strategic solutions, 
supply chain management, stock markets, power 
economy, social evolution, military operations, security, 
and ecology (North and Macal, 2007). As ABM tools 
and resources become more accessible, research and 
applications of agent-based intelligent system modeling 
are expected to increase in the near future. 

Some challenges remain, though. Using ABM 
to model intelligent systems is a research area that 
draws theories from other fields, such as economics, 
psychology, sociology, etc., but without its own well 
established theoretic foundation. ABM has four key 
assumptions (Macy and Wilier, 2002): Agents act 
locally with little or no central authority; agents are 
interdependent; agents follow simple rules, and agents 
are adaptive. However, some of those assumptions may 
not be applicable to intelligent system modeling. Central 
authorities, or central authoritative information such as 
mass media in the innovation diffusion example, may 
play an important role in intelligent organizations. Not 
all agents are alike in an intelligent system. Some may 
be independent, non-adaptive, or following complex 
behavior rules. 



ABM uses a "bottom-up" approach, creating 
emergent behaviors of an intelligent system through 
"actors" rather than "factors". However, macro-level 
factors have direct impact on macro behaviors of the 
system. Macy and Wilier (2002) suggest that bringing 
those macro-level factors back will make agent-based 
modeling more effective, especially in intelligent 
systems such as social organizations. 

Recent intelligent systems research has developed the 
concept of integrating human and machine-based data, 
knowledge, and intelligence. Kirn (1996) postulates 
that the organization of the 21 st century will involve 
artificial agents based system highly intertwined with 
human intelligence of the organization. Thus, a new 
challenge for agent-based intelligent system modeling 
is to develop models that account for interaction, 
aggregation, and coordination of intelligent agent and 
human agents. The ABM will represent not only the 
human players in an intelligent system, but also the 
intelligent agents that are developed in real-world 
applications in those systems. 



CONCLUSION 

Modeling intelligent systems involving multiple 
intelligent players has been difficult using traditional 
approaches. We have reviewed recent development 
in agent-based modeling and suggest agent-based 
modeling is well suited for studying intelligent 
systems, especially those systems with sophisticated 
and heterogeneous participants. Agent-based modeling 
allows us to model system behaviors based on the 
actions and interactions of individuals in the system. 
Although most ABM research focuses on local rules 
and behaviors, it is possible that we integrate global 
influences in the models. ABM represents a novel 
approach to model intelligent systems. Combined with 
traditional modeling approaches (for example, micro- 
level simulation as proposed in MoSeS), ABM offers 
researchers a promising tool to solve complex and 
practical problems and to broaden research endeavors 
(Wu, 2007). 
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KEY TERMS 

Agent Based Modeling: Using intelligent agents 
and their actions and interactions in a given environment 
to simulate the complex dynamics of a system. 

Diffusion of Innovation: Popularized by Everett 
Rogers, it is the study of the process by which an 
innovation is communicated and adopted over time 
among the members of a social system. 

Intelligent Agent: An autonomous software 
program that is able to learn and adapt to its environment 
in order to perform certain tasks delegated to it by its 
master. 



Intelligent System: A system that has a coherent 
set of components and subsystems working together 
to engage in goal-driven activities. 

Intelligent System Modeling: The process of 
construction, calibration, and validation of models of 
intelligent systems. 

Multi- Agent System: A distributed system with a 
group of intelligent agents that communicate, bargain, 
compete, and cooperate with other agents and the 
environment to achieve goals designated by their 
masters. 

Organizational Intelligence: The ability of an 
organization to perceive, interpret, and select the most 
appropriate response to the environment in order to 
advance its goals. 
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INTRODUCTION 

A briefing (Allen, 2004) demonstrates the breadth and 
depth complexity required to address real diplomatic, 
information, military, economic (DIME) factors for the 
propagation/evolution of ideas through defined popula- 
tions. An open mind would conclude that it is possible 
that multiple approaches may be required for multiple 
decision makers in multiple scenarios. However, it is 
in the interests of multiple decision-makers to as much 
as possible rely on the same generic model for actual 
computations. Many users would have to trust that the 
coded model is faithful to process their inputs. 

Similar to DIME scenarios, sophisticated competi- 
tive marketing requires assessments of responses of 
populations to new products. 

Many large financial institutions are now trading at 
speeds barely limited by the speed of light. They co- 
locate their servers close to exchange floors to be able 
to turn quotes into orders to be executed within msecs. 
Clearly, trading at these speeds require automated al- 
gorithms for processing and making decisions. These 
algorithms are based on "technical" information derived 
from price, volume and quote (Level II) information. 
The next big hurdle to automated trading is to turn 
"fundamental" information into technical indicators, 
e.g., to include new political and economic news into 
such algorithms. 



BACKGROUND 

The concept of "memes" is an example of an approach 
to deal with DIME factors (Situngkir, 2004). The meme 
approach, using a reductionist philosophy of evolution 
among genes, is reasonably contrasted to approaches 
emphasizing the need to include relatively global influ- 
ences of evolution (Thurtle, 2006). 

There are multiple other alternative works being 
conducted world-wide that must be at least kept in 
mind while developing and testing models of evolu- 
tion/propagation of ideas in defined populations: A 



study on a simple algebraic model of opinion formation 
concluded that the only final opinions are extremal 
ones (Aletti et al., 2006). A study of the influence on 
chaos on opinion formation, using a simple algebraic 
model, concluded that contrarian opinion could persist 
and be crucial in close elections, albeit the authors 
were careful to note that most real populations prob- 
ably do not support chaos (Borghesi & Galam, 2006). 
A limited review of work in social networks illustrates 
that there are about as many phenomena to be explored 
as there are disciplines ready to apply their network 
models (Sen, 2006). 

Statistical Mechanics of Neocortical 
Interactions (SMNI) 

A class of Al algorithms that has not yet been developed 
in this context takes advantage of information known 
about real neocortex. It seems appropriate to base an 
approach for propagation of ideas on the only system 
so far demonstrated to develop and nurture ideas, i.e., 
the neocortical brain. A statistical mechanical model of 
neocortical interactions, developed by the author and 
tested successfully in describing short-term memory 
(STM) and electroencephalography (EEG) indicators, 
is the proposed bottom-up model. Ideas by Statistical 
Mechanics (ISM) is a generic program to model evo- 
lution and propagation of ideas/patterns throughout 
populations subjected to endogenous and exogenous 
interactions (Ingber, 2006). ISM develops subsets of 
macrocolumnar activity of multivariate stochastic de- 
scriptions of defined populations, with macrocolumns 
defined by their local parameters within specific regions 
and with parameterized endogenous inter-regional 
and exogenous external connectivities. Parameters of 
subsets of macrocolumns will be fit to patterns repre- 
senting ideas. Parameters of external and inter-regional 
interactions will be determined that promote or inhibit 
the spread of these ideas. Fitting such nonlinear systems 
requires the use of sampling techniques. 

The author's approach uses guidance from his sta- 
tistical mechanics of neocortical interactions (SMNI), 
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developed in a series of about 30 published papers 
from 1981-2001 (Ingber, 1983; Ingber, 1985; Ingber, 
1992; Ingber, 1994; Ingber, 1995; Ingber, 1997). These 
papers also address long-standing issues of informa- 
tion measured by electroencephalography (EEG) as 
arising from bottom-up local interactions of clusters 
of thousands to tens of thousands of neurons interact- 
ing via short-ranged fibers), or top-down influences of 
global interactions (mediated by long-ranged myelin- 
ated fibers). SMNI does this by including both local 
and global interactions as being necessary to develop 
neocortical circuitry. 

Statistical Mechanics of Financial 
Markets (SMFM) 

Tools of financial risk management, developed to 
process correlated multivariate systems with differ- 
ing non-Gaussian distributions using modern copula 
analysis enables bona fide correlations and uncertain- 
ties of success and failure to be calculated. Since 1984, 
the author has published about 20 papers developing a 
Statistical Mechanics of Financial Markets (SMFM), 
many available at http://www.ingber.com. These are 
relevant to ISM, to properly deal with real-world dis- 
tributions that arise in such varied contexts. 

Gaussian copulas are developed in a project Trad- 
ing in Risk Dimensions (TRD) (Ingber, 2006). Other 
copula distributions are possible, e.g., Student-t distri- 
butions. These alternative distributions can be quite 
slow because inverse transformations typically are not 
as quick as for the present distribution. Copulas are 
cited as an important component of risk management 
not yet widely used by risk management practitioners 
(Blanco, 2005). 

Sampling Tools 

Computational approaches developed to process dif- 
ferent approaches to modeling phenomena must not 
be confused with the models of these phenomena. For 
example, the meme approach lends it self well to a 
computational scheme in the spirit of genetic algorithms 
(GA). The cost/objective function that describes the 
phenomena of course could be processed by any other 
sampling technique such as simulated annealing (SA). 
One comparison (Ingber & Rosen, 1992) demonstrated 
the superiority of SA over GA on cost/objective func- 
tions used in a GA database. That study used Very Fast 



Simulated Annealing (VFSR), created by the author 
for military simulation studies (Ingber, 1989), which 
has evolved into Adaptive Simulated Annealing (ASA) 
(Ingber, 1993). However, it is the author's experience 
that the Art and Science of sampling complex systems 
requires tuning expertise of the researcher as well as 
good codes, and GA or SA likely would do as well on 
cost functions for this study. 

If there are not analytic or relatively standard math 
functions for the transformations required, then these 
transformations must be performed explicitly numeri- 
cally in code such as TRD. Then, the AS A_PARALLEL 
OPTIONS already existing in AS A (developed as part of 
thel994 National Science Foundation Parallelizing AS A 
and PATHINT Project (PAPP)) would be very useful to 
speed up real time calculations (Ingber, 1993). Below, 
only a few topics relevant to ISM are discussed. More 
details are in a previous report (Ingber, 2006). 



SMNI AND SMFM APPLIED TO 
ARTIFICIAL INTELLIGENCE 

Neocortex has evolved to use minicolumns of neurons 
interacting via short-ranged interactions in macrocol- 
umns, and interacting via long-ranged interactions 
across regions of macrocolumns. This common ar- 
chitecture processes patterns of information within 
and among different regions of sensory, motor, as- 
sociative cortex, etc. Therefore, the premise of this 
approach is that this is a good model to describe and 
analyze evolution/propagation of ideas among defined 
populations. 

Relevant to this study is that a spatial-temporal 
lattice-field short-time conditional multiplicative- 
noise (nonlinear in drifts and diffusions) multivariate 
Gaussian-Markovian probability distribution is de- 
veloped faithful to neocortical function/physiology. 
Such probability distributions are a basic input into 
the approach used here. The SMNI model was the first 
physical application of a nonlinear multivariate calculus 
developed by other mathematical physicists in the late 
1970s to define a statistical mechanics of multivariate 
nonlinear nonequilibrium systems (Graham, 1977; 
Langouche et al., 1982). 
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SMNI Tests on STM and EEG 



SMNI Description of STM 



SMNI builds from synaptic interactions to minicolum- 
nar, macrocolumnar, and regional interactions in neo- 
cortex. Since 1981, a series of SMNI papers has been 
developed model columns and regions of neocortex, 
spanning mm to cm of tissue. Most of these papers 
have dealt explicitly with calculating properties of STM 
and scalp EEG in order to test the basic formulation 
of this approach (Ingber, 1983; Ingber, 1985; Ingber 
& Nunez, 1995). 

The SMNI modeling of local mesocolumnar 
interactions (convergence and divergence between 
minicolumnar and macrocolumnar interactions) was 
tested on STM phenomena. The SMNI modeling of 
macrocolumnar interactions across regions was tested 
on EEG phenomena. 



SMNI studies have detailed that maximal numbers of 
attractors lie within the physical firing space of both 
excitatory and inhibitory minicolumnar firings, consis- 
tent with experimentally observed capacities of audi- 
tory and visual STM, when a "centering" mechanism 
is enforced by shifting background noise in synaptic 
interactions, consistent with experimental observations 
under conditions of selective attention (Ingber, 1985; 
Ingber, 1994). 

These calculations were further supported by high- 
resolution evolution of the short-time conditional-prob- 
ability propagator using PATHINT (Ingber & Nunez, 
1995). SMNI correctly calculated the stability and 
duration of STM, the primacy versus recency rule, 



Figure 1. Illustrated are three biophysical scales of neocortical interactions: (a)-(a*)-(a') microscopic neurons; 
(b)-(b') mesocolumnar domains; (c)-(c') macroscopic regions (Ingber, 1983). SMNI has developed appropriate 
conditional probability distributions at each level, aggregating up from the smallest levels of interactions. In 
(a*) synaptic inter-neuronal interactions, averaged over by mesocolumns, are phenomenologically described by 
the mean and variance of a distribution X F. Similarly, in (a) intraneuronal transmissions are phenomenologically 
described by the mean and variance ofY. Mesocolumnar averaged excitatory (E) and inhibitory (I) neuronal 
firings Mare represented in (a'). In (b) the vertical organization of minicolumns is sketched together with their 
horizontal stratification, yielding a physiological entity, the mesocolumn. In (b') the overlap of interacting 
mesocolumns at locations r and r' from times t and t + x is sketched. In (c) macroscopic regions of neocortex 
are depicted as arising from many mesocolumnar domains, (c') sketches how regions may be coupled by long- 
ranged interactions. 




liP urn 




ID* ym 




lb') 



3: 

f A. 


Mb;** 












MKH-rJ 





(<:') 



tfiM l#nt.'o) 



&\h)\ [.frllf'J 



60 



Al and Ideas by Statistical Mechanics 



random access to memories within tenths of a second 
as observed, and the observed 7±2 capacity rule of 
auditory memory and the observed 4±2 capacity rule 
of visual memory. 

SMNI also calculates how STM patterns (e.g., 
from a given region or even aggregated from multiple 
regions) may be encoded by dynamic modification of 
synaptic parameters (within experimentally observed 
ranges) into long-term memory patterns (LTM) (hig- 
her, 1983). 

SMNI Description of EEG 

Using the power of this formal structure, sets of EEG 
and evoked potential data from a separate NIH study, 
collected to investigate genetic predispositions to al- 
coholism, were fitted to an SMNI model on a lattice 
of regional electrodes to extract brain "signatures" 
of STM (Ingber, 1997). Each electrode site was 
represented by an SMNI distribution of independent 
stochastic macrocolumnar-scaled firing variables, 
interconnected by long-ranged circuitry with delays 
appropriate to long-fiber communication in neocor- 
tex. The global optimization algorithm ASA was used 
to perform maximum likelihood fits of Lagrangians 
defined by path integrals of multivariate conditional 
probabilities. Canonical momenta indicators (CMI) 
were thereby derived for individual's EEG data. The 
CMI give better signal recognition than the raw data, 
and were used to advantage as correlates of behavioral 
states. In-sample data was used for training (Ingber, 
1997), and out-of-sample data was used for testing 
these fits. The architecture of ISM is modeled using 
scales similar to those used for local STM and global 
EEG connectivity. 

Generic Mesoscopic Neural Networks 

SMNI was applied to a parallelized generic mesoscopic 
neural networks (MNN) (Ingber, 1992), adding com- 
putational power to a similar paradigm proposed for 
target recognition. 

"Learning" takes place by presenting the MNN with 
data, and parametrizing the data in terms of the firings, 
or multivariate firings. The "weights," or coefficients 
of functions of firings appearing in the drifts and dif- 
fusions, are fit to incoming data, considering the joint 
"effective" Lagrangian (including the logarithm of the 
prefactor in the probability distribution) as a dynamic 



Figure 2. Scales of interactions among minicolumns 
are represented, within macrocolumns, across macro- 
columns, and across regions of macrocolumns 





cost function. This program of fitting coefficients in 
Lagrangian uses methods of ASA. "Prediction" takes 
advantage of a mathematically equivalent representa- 
tion of the Lagrangian path-integral algorithm, i.e., 
a set of coupled Langevin rate-equations. A coarse 
deterministic estimate to "predict" the evolution can 
be applied using the most probable path, but PATHINT 
has been used. PATHINT, even when parallelized, 
typically can be too slow for "predicting" evolution of 
these systems. However, PATHTREE is much faster. 

Architecture for Selected ISM Model 

The primary objective is to deliver a computer model 
that contains the following features: (1) Amultivariable 
space will be defined to accommodate populations. 
(2) A cost function over the population variables in 
(1) will be defined to explicitly define a pattern that 
can be identified as an Idea. A very important issue is 
for this project is to develop cost functions, not only 
how to fit or process them. (3) Subsets of the popula- 
tion will be used to fit parameters — e.g, coefficients 
of variables, connectivities to patterns, etc. — to an 
Idea, using the cost function in (2). (4) Connectivity 
of the population in (3) will be made to the rest of the 
population. Investigations will be made to determine 
what endogenous connectivity is required to stop or 
promote the propagation of the Idea into other regions 
of the population. (5) External forces, e.g., acting only 
on specific regions of the population, will be introduced, 
to determine how these exogenous forces may stop or 
promote the propagation of an Idea. 

Application of SMNI Model 

The approach is to develop subsets of Ideas/macroco- 
lumnar activity of multivariate stochastic descriptions of 
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defined populations (of a reasonable but small popula- 
tion samples, e.g., of 100-1000), with macrocolumns 
defined by their local parameters within specific regions 
(larger samples of populations) and with parameterized 
long-ranged inter-regional and external connectivities. 
Parameters of a given subset of macrocolumns will 
be fit using ASA to patterns representing Ideas, akin 
to acquiring hard-wired long-term (LTM) patterns. 
Parameters of external and inter-regional interactions 
will be determined that promote or inhibit the spread 
of these Ideas, by determining the degree of fits and 
overlaps of probability distributions relative to the 
seeded macrocolumns. 

That is, the same Ideas/patterns may be represented 
in other than the seeded macrocolumns by local conflu- 
ence of macrocolumnar and long-ranged firings, akin 
to STM, or by different hard-wired parameter LTM 
sets that can support the same local firings in other 
regions (possible in nonlinear systems). SMNI also 
calculates how STM can be dynamically encoded into 
LTM (Ingber, 1983). 

Small populations in regions will be sampled to 
determine if the propagated Idea(s) exists in its pattern 
space where it did exist prior to its interactions with the 
seeded population. SMNI derives nonlinear functions 
as arguments of probability distributions, leading to 
multiple STM, e.g., 7±2 for auditory memory capac- 
ity. Some investigation will be made into nonlinear 
functional forms other than those derived for SMNI, 
e.g., to have capacities of tens or hundreds of patterns 
for ISM. 

Application of TRD Analysis 

This approach includes application of methods of port- 
folio risk analysis to such statistical systems, correct- 
ing two kinds of errors committed in multivariate risk 
analyses: (El) Although the distributions of variables 
being considered are not Gaussian (or not tested to see 
how close they are to Gaussian), standard statistical 
calculations appropriate only to Gaussian distribu- 
tions are employed. (E2) Either correlations among 
the variables are ignored, or the mistakes committed 
in (El) — incorrectly assuming variables are Gaussian 
— are compounded by calculating correlations as if all 
variables were Gaussian. 

It should be understood that any sampling algorithm 
processing a huge number of states can find many 
multiple optima. ASA's MULTI_MIN OPTIONS are 



used to save multiple optima during sampling. Some 
algorithms might label these states as "mutations" of 
optimal states. It is important to be able to include them 
in final decisions, e.g., to apply additional metrics of 
performance specific to applications. Experience with 
risk-managing portfolios shows that all criteria are 
not best considered by lumping them all into one cost 
function, but rather good judgment should be applied to 
multiple stages of pre-processing and post-processing 
when performing such sampling, e.g., adding additional 
metrics of performance. 



FUTURE TRENDS 

Given financial and political motivations to merge in- 
formation discussed in the Introduction, it is inevitable 
that many AI algorithms will be developed, and many 
current AI algorithms will be enhanced, to address 
these issues. 



CONCLUSION 

It seems appropriate to base an approach for propa- 
gation of generic ideas on the only system so far 
demonstrated to develop and nurture ideas, i.e., the 
neocortical brain. A statistical mechanical model of 
neocortical interactions, developed by the author and 
tested successfully in describing short-term memory and 
EEG indicators, Ideas by Statistical Mechanics (ISM) 
(Ingber, 2006) is the proposed model. ISM develops 
subsets of macrocolumnar activity of multivariate 
stochastic descriptions of defined populations, with 
macrocolumns defined by their local parameters within 
specific regions and with parameterized endogenous 
inter-regional and exogenous external connectivities. 
Tools of financial risk management, developed to 
process correlated multivariate systems with differ- 
ing non-Gaussian distributions using modern copula 
analysis, importance-sampled using ASA, will enable 
bona fide correlations and uncertainties of success and 
failure to be calculated. 
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KEY TERMS 

Copula Analysis: This transforms non-Gaussian 
probability distributions to a common appropriate 
space (usually a Gaussian space) where it makes sense 
to calculate correlations as second moments. 

DIME: Represents diplomatic, information, mili- 
tary, and economic aspects of information that must 
be merged into coherent pattern. 

Global Optimization: Refers to a collection of 
algorithms used to statistically sample a space of 
parameters or variables to optimize a system, but also 
often used to sample a huge space for information. 
There are many variants, including simulated an- 
nealing, genetic algorithms, ant colony optimization, 
hill-climbing, etc. 

ISM: An anacronym for Ideas by Statistical Me- 
chanics in the context of the noun defined as: A belief 
(or system of beliefs) accepted as authoritative by some 
group or school. A doctrine or theory; especially, a wild 
or visionary theory. A distinctive doctrine, theory, 
system, or practice. 



Meme: Alludes to a technology originally defined 
to explain social evolution, which has been refined 
to mean a gene-like analytic tool to study cultural 
evolution. 

Memory: This may have many forms and mecha- 
nisms. Here, two major processes of neocortical memory 
are used f or AI technologies, short-term memory (STM) 
and long-term memory (LTM). 

Simulated Annealing (SA): A class of algorithms 
for sampling a huge space, which has a mathematical 
proof of convergence to global optimal minima. Most 
S A algorithms applied to most systems do not fully take 
advantage of this proof, but the proof often is useful 
to give confidence that the system will avoid getting 
stuck for a long time in local optimal regions. 

Statistical Mechanics: A branch of mathematical 
physics dealing with systems with a large number of 
states. Applications of nonequilibrium nonlinear statisti- 
cal mechanics are now common in many fields, ranging 
from physical and biological sciences, to finance, to 
computer science, etc. 
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INTRODUCTION 

Biological systems can be viewed as information man- 
agement systems, with a basic instruction set stored 
in each cell's DNA as "genes." For most genes, their 
information is enabled when they are transcribed into 
RNA which is subsequently translated into the proteins 
that form much of a cell's machinery. Although details 
of the process for individual genes are known, more 
complex interactions between elements are yet to be 
discovered. What we do know is that diseases can result 
if there are changes in the genes themselves, in the 
proteins they encode, or if RNAs or proteins are made 
at the wrong time or in the wrong quantities. 

Recent advances in biotechnology led to the de- 
velopment of DNA microarrays, which quantitatively 
measure the expression of thousands of genes simul- 
taneously and provide a snapshot of a cell's response 
to a particular condition. Finding patterns of gene ex- 
pression that provide insight into biological endpoints 
offers great opportunities for revolutionizing diagnostic 
and prognostic medicine and providing mechanistic 
insight in data-driven research in the life sciences, an 
area with a great need for advances, given the urgency 
associated with diseases. However, microarray data 
analysis presents a number of challenges, from noisy 
data to the curse of dimensionality (large number of 
features, small number of instances) to problems with 
no clear solutions (e.g. real world mappings of genes 
to traits or diseases that are not yet known). 

Finding patterns of gene expression in microarray 
data poses problems of class discovery, comparison, 
prediction, and network analysis which are often ap- 
proached with AI methods. Many of these methods have 



been successfully applied to microarray data analysis 
in a variety of applications ranging from clustering of 
yeast gene expression patterns (Eisen et al., 1998) to 
classification of different types of leukemia (Golub et al. , 
1999). Unsupervised learning methods (e.g. hierarchical 
clustering) explore clusters in data and have been used 
for class discovery of distinct forms of diffuse large 
B-cell lymphoma (Alizadeh et a/., 2000). Supervised 
learning methods (e.g. artificial neural networks) utilize 
a previously determined mapping between biological 
samples and classes (i.e. labels) to generate models for 
class prediction. Ak-nearest neighbor (k-NN) approach 
was used to train a gene expression classifier of differ- 
ent forms of brain tumors and its predictions were able 
to distinguish biopsy samples with different prognosis 
suggesting that microarray profiles can predict clini- 
cal outcome and direct treatment (Nutt et al, 2003). 
Bayesian networks constructed from microarray data 
hold promise for elucidating the underlying biological 
mechanisms of disease (Friedman et a/., 2000). 



BACKGROUND 

Cells dynamically respond to their environment by 
changing the set and concentrations of active genes by 
altering the associated RNA expression. Thus "gene 
expression" is one of the main determinants of a cell's 
state, or phenotype. For example, we can investigate the 
differences between a normal cell and a cancer cell by 
examining their relative gene expression profiles. 

Microarrays quantify gene expression levels in vari- 
ous conditions (such as disease vs. normal) or across 
time points. For n genes and m instances (biological 
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Table 1. Some public online repositories of microarray data 



Name of the repository 


URL 


ArrayExpress at the European Bioinformatics Institute 


http://www.ebi.ac.uk/arrayexpress/ 


Gene Expression Omnibus at the National Institutes of 
Health 


http://www.ncbi.nlm.nih.gov/geo/ 


Stanford microarray database 


http://smd.stanford.edu/ 


Oncomine 


http://www.oncomine.org/main/index.jsp 



samples), microarray measurements are stored in an 
n by m matrix where each row is a gene, each column 
is a sample and each element in the matrix is the ex- 
pression level of a gene in a biological sample, where 
samples are instances and genes are features describing 
those instances. Microarray data is available through 
many public online repositories (Table 1). In addition, 
the Kent-Ridge repository (http://sdmc.i2r.a-star.edu. 
sg/rp/) contains pre-formatted data ready to use with 
the well-known machine learning tool Weka (Witten 
& Frank, 2000). 

Microarray data presents some unique challenges for 
AI such as a severe case of the curse of dimensionality 
due to the scarcity of biological samples (instances). 
Microarray studies typically measure tens of thousands 
of genes in only tens of samples. This low case to 
variable ratio increases the risk of detecting spurious 
relationships. This problem is exacerbated because 
microarray data contains multiple sources of within- 
class variability, both technical and biological. The high 
levels of variance and low sample size make feature 
selection difficult. Testing thousands of genes creates 
a multiple testing problem, which can result in under- 
estimating the number of false positives. Given data 
with these limitations, constructing models becomes 
under-determined and therefore prone to over-fitting. 

From biology, it is also clear that genes do not act 
independently. Genes interact in the form of pathways 
or gene regulatory networks. For this reason, we need 
models that can be interpreted in the context of path- 
ways. Researchers have successfully applied AI meth- 
ods to microarray data preprocessing, clustering, feature 
selection, classification, and network analysis. 



MINING MICROARRAY DATA: 
CURRENT TECHNIQUES, CHALLENGES 
AND OPPORTUNITIES FOR AI 

Data Preprocessing 

After obtaining microarray data, normalization is per- 
formed to account for systematic measurement biases 
and to facilitate between-sample comparisons (Quack- 
enbush, 2002). Microarray data may contain missing 
values that may be replaced by mean replacement or 
k-NN imputation (Troyanskaya et al, 2001). 

Feature Selection 

The goal of feature selection is to find genes (features) 
that best distinguish groups of instances (e.g. disease 
vs. normal) to reduce the dimensionality of the dataset. 
Several statistical methods including t-test, significance 
analysis of microarrays (SAM) (Tusher et al., 2001), 
and analysis of variance (ANOVA) have been applied 
to select features from microarray data. 

In classification experiments, feature selection 
methods generally aim to identify relevant gene subsets 
to construct a classifier with good performance (Inza 
et al., 2004). Features are considered to be relevant 
when they can affect the class; the strongly relevant 
are indispensable to prediction and the weakly relevant 
may only sometimes contribute to prediction. 

Filter methods evaluate feature subsets regardless 
of the specific learning algorithm used. The statistical 
methods for feature selection discussed above as well 
as rankers like information gain rankers are filters for 
the features to be included. These methods ignore the 
fact that there may be redundant features (features that 
are highly correlated with each other and as such one 
can be used to replace the other) and so do not seek 
to find a set of features which could perform similarly 
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with fewer variables while retaining the same predic- 
tive power (Guyon & Elisseeff, 2003). For this reason 
multivariate methods are more appropriate. 

As an alternative, wrappers consider the learning 
algorithm as a black-box and use prediction accuracy to 
evaluate feature subsets (Kohavi & John, 1997). Wrap- 
pers are more direct than filter methods but depend on the 
particular learning algorithm used. The computational 
complexity associated with wrappers is prohibitive 
due to curse of dimensionality, so typically filters are 
used with forward selection (starting with an empty set 
and adding features one by one) instead of backward 
elimination (starting with all features and removing 
them one by one). Dimension reduction approaches 
are also used for multivariate feature selection. 

Dimension Reduction Approaches 

Principal component analysis (PC A) is widely used for 
dimension reduction in machine learning (Wall et a/., 
2003). The ideabehind PCAis quite intuitive: correlated 
objects can be combined to reduce data "dimensional- 
ity". Relationships between gene expression profiles in 
a data matrix can be expressed as a linear combination 
such that colinear variables are regressed onto a new 
set of coordinates. PCA, its underlying method Single 
Value Decomposition (S VD), related approaches such as 
correspondence analysis (COA), and multidimensional 
scaling (MDS) have been applied to microarray data 
and are reviewed by Brazma & Culhane (2005). Studies 
have reported that COA or other dual scaling dimension 
reduction approaches such as spectral map analysis may 
be more appropriate than PCA for decomposition of 
microarray data (Wouters et a/., 2003). 

While PCA considers the variance of the whole 
dataset, clustering approaches examine the pairwise 
distance between instances or features. Therefore, these 
methods are complementary and are often both used 
in exploratory data analysis. However, difficulties in 
interpreting the results in terms of discrete genes limit 
the application of these methods. 

Clustering 

What we see as one disease is often a collection of 
disease subtypes. Class discovery aims to discover 
these subtypes by finding groups of instances with 
similar expression patterns. Hierarchical clustering is 
an agglomerative method which starts with a singleton 



and groups similar data points using some distance 
measure such that two data points that are most simi- 
lar are grouped together in a cluster by making them 
children of a parent node in the tree. This process is 
repeated in a bottom-up fashion until all data points 
belong to a single cluster (corresponding to the root 
of the tree). 

Hierarchical and other clustering approaches, 
including K-means, have been applied to microarray 
data (Causton et a/., 2003). Hierarchical clustering 
was applied to study gene expression in samples from 
patients with diffuse large B-cell lymphoma (DLBCL) 
resulting in the discovery of two subtypes of the dis- 
ease. These groups were found by analyzing microar- 
ray data from biopsy samples of patients who had not 
been previously treated. These patients continued to 
be studied after chemotherapy, and researchers found 
that the two newly discovered disease subtypes had 
different survival rates, confirming the hypothesis that 
the subtypes had significantly different pathologies 
(Alizadeh et a/., 2000). 

While clustering simply groups the given data based 
on pair-wise distances, when information is known a 
priori about some or all of the data i.e. labels, a super- 
vised approach can be used to obtain a classifier that 
can predict the label of new instances. 

Classification (Supervised Learning) 

The large dimensionality of microarray data means that 
all classification methods are susceptible to over-fitting. 
Several supervised approaches have been applied to 
microarray data including Artificial Neural Networks 
(ANNs), Support Vector Machines (SVMs), andk-NNs 
among others (Hastie et a/., 2001). 

A very challenging and clinically relevant prob- 
lem is the accurate diagnosis of the primary origin of 
metastatic tumors. Bloom et al. (2004) applied ANNs 
to the microarray data of 21 tumor types with 88% 
accuracy to predict the primary site of origin of meta- 
static cancers with unknown origin. A classification 
of 84% was obtained on an independent test set with 
important implications for diagnosing cancer origin 
and directing therapy. 

In a comparison of different SVM approaches, 
multicategory SVMs were reported to outperform other 
popular machine learning algorithms such as k-NNs 
and ANNs (Statnikov et al, 2005) when applied to 11 
publicly available microarray datasets related to cancer. 



67 



Al Methods for Analyzing Microarray Data 



It is worth noting that feature selection can significantly 
improve classification performance. 

Cross-Validation 

Cross-validation (C V) is appropriate in microarray stud- 
ies which are often limited by the number of instances 
(e.g. patient samples). In k-fold CV, the training set is 
divided into k subsets of equal size. In each iteration 
k-1 subsets are used for training and one subset is 
used for testing. This process is repeated k times and 
the mean accuracy is reported. Unfortunately, some 
published studies have applied CV only partially, by 
applying CV on the creation of the prediction rule 
while excluding feature selection. This introduces a 
bias in the estimated error rates and over-estimates 
the classification accuracy (Simon et a/., 2003). As a 
consequence, results from many studies are contro- 
versial due to methodological flaws (Dupuy & Simon, 
2007). Therefore, models must be evaluated carefully 
to prevent selection bias (Ambroise & McLachlan, 
2002). Nested CV is recommended, with an inner CV 
loop to perform the tuning of the parameters and an 
outer CV to compute an estimate of the error (Varma 
& Simon, 2006). 

Several studies which have examined similar bio- 
logical problems have reported poor overlap in gene 
expression signatures. Brenton et al. (2005) compared 
two gene lists predictive of breast cancer prognosis 
and found only 3 genes in common. Even though the 
intersection of specific gene lists is poor, the highly 
correlated nature of microarray data means that many 
gene lists may have similar prediction accuracy (Ein- 
Dor et a/., 2004). Gene signatures identified from dif- 
ferent breast cancer studies with few genes in common 
were shown to have comparable success in predicting 
patient survival (Buyse et a/., 2006). 

Commonly used supervised learning algorithms 
yield black box models prompting the need for interpre- 
table models that provide insights about the underlying 
biological mechanism that produced the data. 

Network Analysis 

Bayesian networks (BNs), derived from an alliance 
between graph theory and probability theory, can 
capture dependencies among many variables (Pearl, 
1988, Heckerman, 1996). 



Friedman et al. (2000) introduced a multinomial 
model framework for BNs to reverse-engineer networks 
and showed that this method differs from clustering in 
that it can discover gene interactions other than cor- 
relation when applied to yeast gene expression data. 
Spirtes et al. (2002) highlight some of the difficulties of 
applying this approach to microarray data. Nevertheless, 
many extensions of this research direction have been 
explored. Correlation is not necessarily a good predictor 
of interactions, and weak interactions are essential to 
understand disease progression. Identifying the biologi- 
cally meaningful interactions from the spurious ones is 
challenging, and BNs are particularly well-suited for 
modeling stochastic biological processes. 

The exponential growth of data produced by mi- 
croarray technology as well as other high-throughput 
data (e.g. protein-protein interactions) call for novel Al 
approaches as the paradigm shifts from a reductionist 
to a mechanistic systems view in the life sciences. 



FUTURE TRENDS 

Uncovering the underlying biological mechanisms 
that generate these data is harder than prediction and 
has the potential to have far reaching implications for 
understanding disease etiologies. Time series analysis 
(Bar- Joseph, 2004) is a first step to understanding the 
dynamics of gene regulation, but, eventually, we need to 
use the technology not only to observe gene expression 
data but also to direct intervention experiments (Pe'er 
et a/., 2001, Yoo et a/., 2002) and develop methods to 
investigate the fundamental problem of distinguishing 
correlation from causation. 



CONCLUSION 

We have reviewed Al methods for pre-processing, 
clustering, feature selection, classification and mecha- 
nistic analysis of microarray data. The clusters, gene 
lists, molecular fingerprints and network hypotheses 
produced by these approaches have already shown 
impact; from discovering new disease subtypes and 
biological markers, predicting clinical outcome for 
directing treatment as well as unraveling gene networks. 
From the Al perspective, this field offers challenging 
problems and may have a tremendous impact on biol- 
ogy and medicine. 
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KEY TERMS 

Curse of Dimensionality: A situation where the 
number of features (genes) is much larger than the 
number of instances (biological samples) which is 
known in statistics as p » n problem. 

Feature Selection: Aproblem of finding a subset (or 
subsets) of features so as to improve the performance 
of learning algorithms. 

Microarray: Amicroarray is an experimental assay 
which measures the abundances of mRNA (intermedi- 
ary between DNA and proteins) corresponding to gene 
expression levels in biological samples. 

Multiple testing problem: A problem that occurs 
when a large number of hypotheses are tested simul- 
taneously using a user-defined a cut off p- value which 
may lead to rejecting a non-negligible number of null 
hypotheses by chance. 

Over-Fitting: A situation where a model learns 
spurious relationships and as a result can predict training 
data labels but not generalize to predict future data. 

Supervised Learning: A learning algorithm that 
is given a training set consisting of feature vectors as- 
sociated with class labels and whose goal is to learn 
a classifier that can predict the class labels of future 
instances. 

Unsupervised Learning: Alearning algorithm that 
tries to identify clusters based on similarity between 
features or between instances or both but without tak- 
ing into account any prior knowledge. 
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INTRODUCTION 

This work is intended for providing a review of real- 
life practical applications of Artificial Intelligence (AI) 
methods. We focus on the use of Machine Learning (ML) 
methods applied to rather real problems than synthetic 
problems with standard and controlled environment. 
In particular, we will describe the following problems 
in next sections: 

Optimization of Erythropoietin (EPO) dosages 

in anaemic patients undergoing Chronic Renal 

Failure (CRF). 

Optimization of a recommender system for citizen 

web portal users. 

Optimization of a marketing campaign. 

The choice of these problems is due to their 
relevance and their heterogeneity. This heterogeneity 
shows the capabilities and versatility of ML methods 
to solve real-life problems in very different fields of 
knowledge. The following methods will be mentioned 
during this work: 

Artificial Neural Networks (ANNs): Multilayer 
Perceptron (MLP), Finite Impulse Response (FIR) 
Neural Network, Elman Network, Self-Oganizing 
Maps (SOMs) and Adaptive Resonance Theory 
(ART). 

Other clustering algorithms: K-Means, Expec- 
tation-Maximization (EM) algorithm, Fuzzy 
C-Means (FCM), Hierarchical Clustering Algo- 
rithms (HCA). 



Generalized Auto-Regressive Conditional Het- 
eroskedasticity (GARCH). 
Support Vector Regression (SVR). 
Collaborative filtering techniques. 
Reinforcement Learning (RL) methods. 



BACKGROUND 

The aim of this communication is to emphasize the 
capabilities of ML methods to deliver practical and 
effective solutions in difficult real- world applications. In 
order to make the work easy to read we focus on each of 
the three separate domains, namely, Pharmacokinetics 
(PK), Web Recommender Systems and Marketing. 

Pharmacokinetics 

Clinical decision-making support systems have used 
Artificial Intelligence (AI) methods since the end of 
the fifties. Nevertheless, it was only during the nineties 
that decision support systems were routinely used in 
clinical practice on a significant scale. In particular, 
ANNs have been widely used in medical applications 
the last two decades (Lisboa, 2002). One of the first 
relevant studies involving ANNs and Therapeutic Drug 
Monitoring was (Gray, Ash, Jacobi, & Michel, 1991). 
In this work, an ANN-based drug interaction warning 
system was developed with a computerized real-time 
entry medical records system. A reference work in this 
field is found in (Brier, Zurada, & Aronoff, 1995), in 
which the capabilities of ANNs and NONMEN are 
benchmarked. 
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Focusing on problems that are closer to the real- 
life application that will be described in next section, 
there are also a number of recent works involving the 
use of ML for drug delivery in kidney disease. For 
instance, a comparison of renal-related adverse drug 
reactions between rofecoxib and celecoxib, based on 
the WHO/Uppsala Monitoring Centre safety database, 
was carried out by (Zhao, Reynolds, Lejkowith, 
Whelton, & Arellano, 2001). Disproportionality in the 
association between a particular drug and renal-related 
adverse drug reactions was evaluated using a Bayesian 
confidence propagation neural network method. A 
study of prediction of cyclosporine dosage in patients 
after kidney transplantation using neural networks and 
kernel-based methods was carried out in (Camps et al., 
2003). In (Gaweda, Jacobs, Brier, & Zurada, 2003), a 
pharmacodynamic population analysis in CRF patients 
using ANNs was performed. Such models allow for 
adjusting the dosing regime. Finally, in (Martin et al., 
2003) , the use of neural networks was proposed for 
the optimization of EPO dosage in patients undergoing 
anaemia connected with CRF. 

Web Recommender Systems 

Recommender systems are widely used in web sites 
including Google. The main goal of these systems is to 
recommend obj ects which a user might be interested in. 
Two main approaches have been used: content-based 
and collaborative filtering (Zukerman & Albrecht, 
2001), although other kinds of techniques have also 
been proposed (Burke, 2002). 

Collaborative recommenders aggregate ratings 
of recommendations of objects, find user similarities 
based on their ratings, and finally provide new 
recommendations based on inter-user comparisons. 
Some of the most relevant systems using this technique 
are GroupLens/NetPerceptions and Recommender. 
The main advantage of collaborative techniques is 
that they are independent from any machine-readable 
representation of the objects, and that they work well 
for complex objects where subjective judgements are 
responsible for much of the variation in preferences. 

Content-based learning is used when a user's past 
behaviour is a reliable indicator of his/her future 
behaviour. It is particularly suitable for situations in 
which users tend to exhibit idiosyncratic behaviour. 
However, this approach requires a system to collect 
relatively large amounts of data from each user in order 



to enable the formulation of a statistical model. Examples 
of systems of this kind are text recommendation systems 
like the newsgroup filtering system, NewsWeeder, 
which uses words from its texts as features. 

Marketing 

The latest marketing trends are more concerned about 
maintaining current customers and optimizing their 
behaviour than getting new ones. For this reason, 
relational marketing focuses on what a company must 
do to achieve this obj ective. The relationships between a 
company and its costumers follow a sequence of action- 
response system, where the customers can modify their 
behaviour in accordance with the marketing actions 
developed by the company. 

The development of a good and individualized 
policy is not easy because there are many variables 
to take into account. Applications of this kind can 
be viewed as a Markov chain problem, in which a 
company decides what action to take once the customer 
properties in the current state (time t), are known. 
Reinforcement Learning (RL) can be used to solve this 
task since previous applications have demonstrated its 
suitability in this area. In (Sun, 2003), RL was applied 
to analyse mailing by studying how an action in time 
t influences actions in following times. In (Abe et al., 
2002) and (Pednault, Abe & Zadrozny., 2002), several 
RL algorithms were benchmarked in mailing problems. 
In (Abe, 2004), RL was used to optimize cross channel 
marketing. 



Al CONTRIBUTIONS IN REAL-LIFE 
APPLICATIONS 

Previous section showed a review of related work. 
In this section, we will focus on showing authors' 
experience in using Al to solve real-life problems. In 
order to show up the versatility of Al methods, we will 
focus on particular applications from three different 
fields of knowledge, the same that were reviewed in 
previous section. 

Pharmacokinetics 

Although we have also worked with other 
pharmacokinetic problems, in this work, we focus 
on maybe the most relevant problem, which is the 



72 



An Al Walk from Pharmacokinetics to Marketing 



optimization of EPO dosages in patients within a 
haemodialysis program. Patients who suffer from 
CRF tend to suffer from an associated anaemia, as 
well. EPO is the treatment of choice for this kind of 
anaemia. The use of this drug has greatly reduced 
cardiovascular problems and the necessity of multiple 
transfusions. However, EPO is expensive, making the 
already costly CRF program even more so. Moreover, 
there are significant risks associated with EPO such 
as thrombo-embolisms and vascular problems, if 
Haemoglobin (Hb) levels are too high or they increase 
too fast. Consequently, optimizing dosage is critical 
to ensure adequate pharmacotherapy as well as a 
reasonable treatment cost. 

Population models, widely used by Pharmacoki- 
netics' researchers, are not suitable for this problem 
since the response to the treatment with EPO is highly 
dependent on the patient. The same dosages may have 
very different responses in different patients, most 
notably the so-called EPO-resistant patients, who do 
not respond to EPO treatment, even after receiving 
high dosages. Therefore, it is preferable to focus on 
an individualized treatment. 

Our first approach to this problem was based on 
predicting the Hb level given a certain administered 
dose of EPO. Although the final goal is to individualize 
EPO doses, we did not predict EPO dose but Hb 
level. The reason is that EPO predictors would model 
physician's protocol whereas Hb predictors model 
body's response to the treatment, hence being a more 
"objective" approach. In particular, the following 
models were used: GARCH (Hamilton, 1994), MLP, 
FIR neural network, Elman's recurrent neural network 
and SVR (Haykin, 1999). Accurate prediction models 
were obtained, especially when using ANNs and SVR. 
Dynamic neural networks (i.e., FIR and recurrent) did 
not outperform notably the static MLP probably due to 
the short length of the time series (Martin et al., 2003). 
An easy-to-use software application was developed to 
be used by clinicians, in which after filling in patients' 
data and a certain EPO dose, the predicted Hb level for 
next month was shown. 

Although prediction models were accurate, we 
realized that this prediction approach had a major 
flaw. Despite obtaining accurate models, we had not 
yet achieved a straightforward way to transfer the 
extracted knowledge to daily clinical practice, because 
clinicians had to "play" with different doses to analyse 
the best solution to attain a certain Hb level. It would 



be better to have an automatic model that suggests 
the actions to be made in order to attain the targeted 
range of Hb, rather than this "indirect" approach. This 
reflection made us research on new models, and we 
came up with the use of RL (Sutton & Barto, 1998). 
We are currently working on this topic but we have 
already achieved promising results, finding policies 
(sequence of actions) that appear to be better than those 
followed in the hospital, i.e., there are a higher number 
of patients within the desired target of Hb at the end 
of the treatment (Martin et al., 2006a). 

Web Recommender Systems 

A completely different application is described in 
this subsection, namely, the development of web 
recommender systems. The authors proposed a new 
approach to develop recommender systems based on 
collaborative filtering, but also including an analysis of 
the feasibility of the recommender by using a prediction 
stage (Martin et al., 2006b). 

The very basic idea was to use clustering algorithms 
in order to find groups of similar users. The following 
clustering algorithms were taken into account: K- 
Means, FCM, HCA, EM algorithm, SOMs and ART. 
New users were assigned to one of the groups found 
by these clustering algorithms, and then they were 
recommended with web services that were usually 
accessed by other users of his/her same group, but 
had not yet been accessed by these new users (in order 
to maximize the usefulness of the approach). Using 
controlled data sets, the study concluded that ART and 
SOMs showed a very good behaviour with data sets 
of very different characteristics, whereas HCA and 
EM showed an acceptable behaviour provided that 
the dimensionality of the data set was not too high and 
the overlap was slight. Algorithms based on K-Means 
achieved the most limited success in the acceptance of 
offered recommendations. 

Even though the use of RL was only slightly 
studied, it seems to be a suitable choice for this 
problem, since the internal dynamics of the problem 
is easily tackled by RL, and moreover the interference 
between the recommendation interface and the user can 
be minimized with an adequate definition of the rewards 
(Hernandez, Gaudioso, & Boticario, 2004). 
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Marketing 

The last application that will be mentioned in this 
communication is related to marketing. One way to 
increase the loyalty of customers is by offering them 
the opportunity to obtain some gifts as the result of 
their purchases from a certain company. The company 
can give virtual credits to anyone who buys certain 
articles, typically those that the company is interested 
in promoting. After a certain number of purchases, the 
customers can exchange their virtual credits for the gifts 
offered by the company. The problem is to establish the 
appropriate number of virtual credits for each promoted 
item. In accordance with the company policy, it is 
expected that the higher the credit assignment, the higher 
the amount of purchases. However, the company's 
profits are lower since the marketing campaign adds 
an extra cost to the company. The goal is to achieve a 
trade-off by establishing an optimal policy. 

We proposed a RL approach to optimize this 
marketing campaign. This particular application, whose 
characteristics are described below, is much more 
difficult than the other RL approaches to marketing 
mentioned in the Background Section. This is basically 
because there are many more different actions that can 
be taken. The information used for the study corresponds 
to five months of the campaign, involving 1,264,862 
transactions, 1,004 articles and 3,573 customers. 

RL can deal with intrinsic dynamics, and besides, it 
has the attractive advantage that is able to maximize the 
so-called long-term reward. This is especially relevant 
in this application since the company is interested in 
maximizing the profits at the end of the campaign, 
and a customer who do not produce much profits in 
the first months of the campaign, may however make 
many profitable transactions in the future. 

Our first results showed that profits using a policy 
based on RL instead of the policy followed by the 
company so far, could even double long-term profits 
at the end of the campaign (Gomez et al., 2005). 



CONCLUSION AND FUTURE TRENDS 

This paper has shown the capabilities and versatility 
of different Al methods to be applied to real-life 
problems, illustrated with three specific applications 
in different domains. Clearly, the methodology is 
generic and applies equally well to many other fields, 



provided that the information contained in the data is 
sufficiently rich to require non-linear modelling and is 
capable of supporting a predictive performance that is 
of practical value. 

As a next future trend, it should be emphasized 
that Al methods are increasingly popular for business 
applications in recent years, challenging classical 
business models. 

In the particular case of RL, the commercial potential 
of this powerful methodology has been significantly 
underestimated, as it is applied almost exclusively to 
Robotics. We feel that it is a methodology still to be 
exploited in many real applications, as we have shown 
in this paper. 
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KEY TERMS 

Agent: In RL terms, it is the responsible of 
making decisions according to observations of its 
environment. 

Environment: In RL terms, it is every external 
condition to the agent. 

Exploration-Explotation Dilemma: It is a classical 
RL dilemma, in which a trade-off solution must be 
achieved. Exploration means random search of new 
actions in order to achieve a likely (but yet unknown) 
better reward than all the known ones, while explotation 
is focused on exploiting the current knowledge for the 
maximization of the reward (greedy approach). 

Life-Time Value: It is a measure widely used in 
marketing applications that offers the long-term result 
that has to be maximized. 

Reward: In RL terms, the immediate reward is 
the value returned by the environment to the agent 
depending on the taken action. The long-term reward 
is the sum of all the immediate rewards throughout a 
complete decision process. 

Sensitivity: Similar measure that offers the ratio 
of positives that are correctly classified by the model. 
(Refer to Specificity.) 

Specificity: Success rate measure in a classification 
problem. If there are two classes (namely, positive and 
negative), specificity measures the ratio of negatives 
that are correctly classified by the model. 
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INTRODUCTION 

Association Rule Mining (ARM) is one of the important 
data mining tasks that has been extensively researched 
by data-mining community and has found wide applica- 
tions in industry. An Association Rule is a pattern that 
implies co-occurrence of events or items in a database. 
Knowledge of such relationships in a database can be 
employed in strategic decision making in both com- 
mercial and scientific domains. 

A typical application of ARM is market basket 
analysis where associations between the different items 
are discovered to analyze the customer's buying habits. 
The discovery of such associations can help to develop 
better marketing strategies. ARM has been extensively 
used in other applications like spatial-temporal, health 
care, bioinf ormatics, web data etc (Hipp J., Giintzer U., 
Nakhaeizadeh G. 2000). 

An association rule is an implication of the form 
X — > Y where X and Y are independent sets of attri- 
butes/items. An association rule indicates that if a set 
of items X occurs in a transaction record then the set of 
items Y also occurs in the same record. X is called the 
antecedent of the rule and Yis called the consequent of 
the rule. Processing massive datasets for discovering 
co-occurring items and generating interesting rules in 
reasonable time is the objective of all ARM algorithms. 
The task of discovering co-occurring sets of items 
cannot be easily accomplished using SQL, as a little 
reflection will reveal. Use of 'Count' aggregate query 
requires the condition to be specified in the where 
clause, which finds the frequency of only one set of 
items at a time. In order to find out all sets of co-oc- 
curring items in a database with n items, the number 
of queries that need to be written is exponential in n. 
This is the prime motivation for designing algorithms 



for efficient discovery of co-occurring sets of items, 
which are required to find the association rules. 

In this article we focus on the algorithms for asso- 
ciation rule mining (ARM) and the scalability issues 
in ARM. We assume familiarity of the reader with 
the motivation and applications of association rule 
mining 



BACKGROUND 

Let I = fz' z' . ., i } denote a set of items and D denote 

L V 2' ' n J 

a database of N transactions. A typical transaction TeD 
may contain a subset X of the entire set of items I and 
is associated with a unique identifier TID. An item-set 
is a set of one or more items i.e. X is an item-set if 
IcI.A k-item-set is an item-set of cardinality k. A 
transaction is said to contain an item-set X if X cz T. 
Support of an item set X, also called Coverage is the 
fraction of transactions that contain X. It denotes the 
probability that a transaction contains X. 



Support(X) = P(X) 



No. of transactions containing X 

N 



An item-set having support greater than the user 
specified support threshold (ms) is known as frequent 
item-set. 

An association rule is an implication of the form X 
— >y [Support, Confidence] whereXczI, Fez I and XnY 
= 0, where Support and Confidence are rule evaluation 
metrics. Support of a rule X—> Fin D is 'S" if S% of 
transactions in D contain Iu7.lt is computed as: 



Support(X -> Y) = P(X uY) = 



No. of transaction containing X uY 

N 
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Support indicates the prevalence of a rule. In a 
typical market basket analysis application, rules with 
very low support values represent rare events and are 
likely to be uninteresting or unprofitable. Confidence 
of a rule measures its strength and provides an indica- 
tion of the reliability of prediction made by the rule. 
A rule X -► Y has a confidence 'C" in D if C % of 
transactions in D that contain X, also contain Y. Con- 
fidence is computed, as the conditional probability of 
Y occuring in a transaction, given X is present in the 
same transaction, i.e. 



Confidence(X -> Y) = 



■*%>■ 



P(X uY) _ Support(X uY) 



P(X) 



Support(X) 



A rule generated from frequent item-sets is strong 
if its confidence is greater than the user specified 
confidence threshold (mc). Fig. 1 shows an example 
database of five transactions and shows the computa- 
tion of support and confidence of a rule. 

The objective of Association Rule Mining algo- 
rithms is to discover the set of strong rules from a given 
database as per the user specified ms and mc thresholds. 
Algorithms for ARM essentially perform two distinct 
tasks: (1) Discover frequent item-sets. (2) Generate 
strong rules from frequent item-sets. 

The first task requires counting of item-sets in 
the database and filtering against the user specified 
threshold (ms). The second task of generating rules 
from frequent item-sets is a straightforward process 
of generating subsets and checking for the strength. 
We describe below the general approaches for finding 
frequent item-sets in association rule mining algorithms. 
The second task is trivial as explained in the last sec- 
tion of the article. 



APPROACHES FOR GENERATING 
FREQUENT ITEM-SETS 

If we apply a brute force approach to discover frequent 
item-sets, the algorithm needs to maintain counters for 
all 2 n -l item-sets. For large values of n that are common 
in the datasets being targeted for mining, maintaining 
such large number of counters is a daunting task. Even if 
we assume availability of such large memory, indexing 
of these counters also presents a challenge. Data mining 
researchers have developed numerous algorithms for 
efficient discovery of frequent item-sets. 

The earlier algorithms for ARM discovered all 
frequent item-sets. Later it was shown by three inde- 
pendent groups of researchers (Pasquier N., Bastide 
Y., Taouil R. & Lakhal L. 1999), (Zaki M.J. 2000), 
(Stumme G., 1999), that it is sufficient to discover 
frequent closed item-sets (FCI) instead of all frequent 
item-sets (FI). FCI are the item-sets whose support is 
not equal to the support of any of its proper superset. 
FCI is a reduced, complete and loss less representa- 
tion of frequent item-sets. Since FCI are much less in 
number than FI, computational expense for ARM is 
drastically reduced. 

Figure 2 summarizes different approaches used for 
ARM. We briefly describe these approaches. 

Discovery of Frequent Item-Sets 

Level-Wise Approach 

Level wise algorithms start with finding the item-sets of 
cardinality one and gradually work up to the frequent 
item-sets of higher cardinality. These algorithms use 
anti-monotonic property of frequent item-sets accord- 



Figure 1. Computation of support and confidence of a rule in an example database 
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Let ms=40%, mc=70% 

Consider the association rule B— ►£), 

support (B^D) = 3/5 = 60% 

confidence(5^D) = support(B? D)/support(B) 

= 3/4 = 75% 

The rule B^>D is a strong rule. 
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Figure 2. Approaches for ARM algorithms 
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ing to which, no superset of an infrequent item-set can 
be frequent. 

Agarwaletal. (Agarwal, R., ImielinskiT., & Swami 
A. 1993), (Agarwal, R., & Swami A., 1994) proposed 
Apriori algorithm, which is the most popular iterative 
algorithm in this category. It starts, with finding the 
frequent item-sets of size one and goes up level by 
level, finding candidate item-sets of size k by joining 
item-sets of size k-1. Two item-sets, each of size k-1 
join to form an item-set of size k if and only if they have 
first k-2 items common. At each level the algorithm 
prunes the candidate item-sets using anti-monotonic 
property and subsequently scans the database to find 
the support of pruned candidate item-sets. The process 
continues till the set of frequent item-sets is non- 
empty. Since each iteration requires a database scan, 
maximum number of database scans required is same 
as the size of maximal item-set. Fig. 3 and Fig 4 gives 
the pseudo code of Apriori algorithm and a running 
example respectively. 

Two of the major bottlenecks in Apriori algorithm 
are i) number of passes and ii) number of candidates 
generated. The first is likely to cause I/O bottleneck 
and the second causes heavy load on memory and CPU 
usage. Researchers have proposed solutions to these 
problems with considerable success. Although detailed 
discussion of these solutions is beyond the scope of 
this article, a brief mention is necessary. 

Hash techniques reduce the number of candidates 
by making a hash table and discarding a bucket if it has 
support less than the ms. Thus at each level memory 
requirement is reduced because of smaller candidate 
set. The reduction is most significant at lower levels. 
Maintaining a list of transaction ids for each candidate 
set reduces the database access. Dynamic Item-set 
Counting algorithm reduces the number of scans by 



counting candidate sets of different cardinality in a 
single scan (Brin S., Motwani R., Ullman J.D., & Tsur 
S. 1997). Pincer Search algorithm uses a bi-directional 
strategy to prune the candidate set from top (maximal) 
and bottom (1-itemset) (Lin D. & Kedem Z.M. 1998). 
Partitioning and Sampling strategies have also been 
proposed to speed up the counting task. An excellent 
comparison of Apriori algorithm and its variants has 
been given in (Hipp J., Giintzer U., Nakhaeizadeh G. 
2000). 

Tree Based Algorithms 

Tree based algorithms have been proposed to overcome 
the problem of multiple database scans. These algo- 
rithms compress (sometimes lossy) the database into a 
tree data structure and reduce the number of database 
scans appreciably. Subsequently the tree is used to mine 
for support of all frequent item-sets. 

Set-Enumeration tree used in Max Miner algorithm 
(Bayardo R.J. 1998) orders the candidate sets while 
searching for maximal frequent item-sets. The data 
structure facilitates quick identification of long frequent 
item-sets based on the information gathered during each 
pass. The algorithm is particularly suitable for dense 
databases with maximal item-sets of high cardinality. 

Han et. al. (Han, J., Pei, J., & Yin, Y. 2000) pro- 
posed Frequent Pattern (FP)-growth algorithm which 
performs a database scan and finds frequent item-sets 
of cardinality one. It arranges all frequent item-sets in 
a table (header) in the descending order of their sup- 
ports. During the second database scan, the algorithm 
constructs in-memory data structure called FP-Tree 
by inserting each transaction after rearranging it in 
descending order of the support. A node in FP-Tree 
stores a single attribute so that each path in the tree 
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Figure 3. Apriori algorithm 



Input: Database D of N transactions 

ms,mc 
Output: Set L of frequent item- sets 

Procedure 

1 . scan the database and find Li = { frequent 1 -item s ets } 

2. k=2 

3. while U-i^ 4> 

4. Cic = gen_candi (fete (U. i) 

5. U=prune(C kT U.i) 

6. k++ 

7. return L = l^U 

gen_candi date(Lfc. i) 

1. for each h e L^i 

2. for each la € L^.i 

3. if(l![l]=l 2 [l])A(li[2]=l 2 [2])A ...Aai[k-2]=l 2 [k-2])A 

(li[k-l] <l 2 [k-l])then 

4. C k = l 1 [l]l 1 [2]...l 1 [k-2]l 1 [k-l]l 2 [k-l] 



Prune (C h? U.i) 

// remove candidate items ets having infrequent subsets 

1. force Cic 

2. for each (k- 1) subset s of c 

3. ifs g Lfc.ithen 

4. remove c from Cic 

5. for each ce Ct 

6. s can the database to find supp ort f c 

7. add c to L^if support(c)>ms 

8. return L^c 



represents and counts the corresponding record in the 
database. A link from the header connects all the nodes 
of an item. This structural information is used while 
mining the FP-Tree. FP-Growth algorithm recursively 
generates sub-trees from FP-Trees corresponding to 
each frequent item-set. 

Coenen et. al. (Coenen R, Leng P., & Ahmed 
S. 2004) proposed Total Support Tree (T-Tree) and 
Partial Support Tree (P-Tree) data structures which 
offer significant advantage in terms of storage and 
execution. These data structures are compressed set 
enumeration trees and are constructed after one scan 
of the database and stores all the item-sets as distinct 
records in database. 

Discovery of Frequent Closed Item-Sets 

Level Wise Approach 

Pasquier et. al. (Pasquier N., Bastide Y., Taouil R. 
& Lakhal L. 1999) proposed Close method to find 



Frequent Closed Item-sets (FCI). This method finds 
closures based on Galois closure operators and com- 
putes the generators. Galois closure operator h(X) for 
some X cz I is defined as the intersection of transactions 
in D containing item-set X. An item-set X is a closed 
item-set if and only if h(X) = X. One of the smallest 
arbitrarily chosen item-set p, such that h(p) = X is 
known as generator of X. 

Close method is based on Apriori algorithm. It 
starts from 1- item-sets, finds the closure based on 
Galois closure operator, goes up level by level com- 
puting generators and their closures (i.e. FCI) at each 
level. At each level, candidate generator item-sets of 
size k are found by joining generator item-sets of size 
k-1 using the combinatorial procedure used in Apriori 
algorithm. The candidate generators are pruned using 
two strategies i) remove candidate generators whose 
all subsets are not frequent ii) remove the candidate 
generators if closure of one of its subsets is superset of 
the generator. Subsequently algorithm finds the support 
of pruned candidate generator. Each iteration requires 
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Figure 4. Running example ofapriori algorithm for finding frequent itemsets (ms = 40%) 

Iteration 1 
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Iteration 2 
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Iteration 4: No candidate item-sets 



one pass over the database to construct the set of FCI 
and count their support. 

Tree Based Approach 

Wang et. al. (Wang J., Han J. & Pei J. 2003) proposed 
Closet+ algorithm to compute FCI and their supports 
using FP-tree structure. The algorithm is based on divide 
and conquers strategy and computes the local frequent 
items of a certain prefix by building and scanning its 
projected database. 



Concept Lattice Based Approach 

Concept lattice is a core structure of Formal Concept 
Analysis (FCA). FCA is a branch of mathematics based 
on Concept and Concept hierarchies. Concept (A,B) is 
defined as a pair of set of objects A (known as extent) 
and set of attributes B (known as intent) such that set 
of all attributes belonging to extent A is same as B 
and set of all objects containing attributes of intent 
B is same as A. In other words, no object other than 
objects of set A contains all attributes of B and no at- 
tribute other than attributes in set B is contained in all 
objects of set A. Concept lattice is a complete lattice of 
all Concepts. Stumme G., (1999) discovered that intent 
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Exhibit A. 



add extent {all transactions} in the list of extents 
For each item z e I 

for each set X in the list of extents 

find X n {set of transactions containing i} 
include in the list of extents if not included earlier 
EndFor 
EndFor 



B of the Concept (A,B) represents the closed item-set, 
which implies that all algorithms for finding Concepts 
can be used to find closed item-sets. Kuznetsov S.O., 
& Obiedkov S.A. (2002) provides a comparison of 
performance of various algorithms for concepts. The 
naive method to compute Concepts, proposed by Ganter 
is given in Exhibit A. 

This method generates all the Concepts i.e. all closed 
item-sets. Closed item-sets generated using this method 
in example 1 are {A},{B} ,{C},{A,B},{A y C},{B,D},{B y 
C,D}, {B,D,E}, {B,C,D,E}. Frequent Closed item-sets 
are {A} ,{B},{C},{B,DUB,C,D},{B,D,E}. 

Concept lattice for frequent closed item-sets is 
given in Figure 5. 

Generating Association Rules 



Figure 5. Concept lattice 



({1, 2,3,4 ,5},{}) 



.":/.;"/:: 



::>"■;:.:. 




({},{A,B,C,D,E}) 



Once all frequent item-sets are known, association rules 
can be generated in a straightforward manner by find- 
ing all subsets of an item-sets and testing the strength 
(Han J., & Kamber M., 2006). The pseudo code for 
this algorithm is given in Exhibit B. 

Based on the above algorithm, strong rules generated 
from frequent item-set BCD in Example 1 are: 

BC^D, conf=100% 
CD ^£, conf=100% 
where mc = 70% 

There are two ways to find association rules from 
frequent closed item-sets: 

i) compute frequent item-sets from FCI and then 

find the association rules 
ii) generate rules directly using FCI. 

Close method uses the first approach, which gen- 
erates lot of redundant rules while method proposed 
by Zaki (Zaki M.J., 2000), (Zaki, M.J., & Hsiao C, 
J., 2005) uses the second approach and derives rules 



directly from the Concept lattice. The association rules 
thus derived are non-redundant rules. For example, set 
of strong rules generated using Close method in Ex- 
ample 1 is {BC -► D,CD ^B,D ^B,E -► B,E ^D,E 
— > BD, BE — >D,DE — >B}. For the same example, set 
of non-redundant strong rules generated using Concept 
Lattice approach is {D ^B, E -► BD, BC -> D, CD -> 
B}. We can observe here that all rules can be derived 
from the reduced non-redundant set of rules. 
Scalability issues in Association Rule Mining 
Scalability issues in ARM have motivated de- 
velopment of incremental and parallel algorithms. 
Incremental algorithms for ARM preserve the counts 
of selective item-sets and reuse this knowledge later to 
discover frequent item-sets from augmented database. 
Fast update algorithm (FUP) is the earliest algorithm 
based on this idea. Later different algorithms are 
presented based on sampling (Hipp J., Guntzer U., & 
Nakhaeizadeh C, 2000). 

Parallel algorithms partition either the dataset for 
counting or the set of counters, across different ma- 
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Exhibit B. 



For each frequent item-set I, 

generate all non-empty subsets of I 

For every non-empty subset s of I, 

Output the rule s — > (Is) if support(i) / support (s) >= 

EndFor 
EndFor 



chines to achieve scalability (Hipp J., Guntzer U., & 
Nakhaeizadeh G., 2000). Algorithms, which partition 
the dataset exchange counters while the algorithms, 
which partition the counters, exchange datasets incur- 
ring high communication cost. 



FUTURE TRENDS 

Discovery of Frequent Closed Item-sets (FCI) is a 
big lead in ARM algorithms. With the current growth 
rate of databases and increasing applications of ARM 
in various scientific and commercial applications we 
envisage tremendous scope for research in parallel, 
incremental and distributed algorithms for FCI. Use of 
lattice structure for FCI offers promise of scalability. On 
line mining on streaming datasets using FCI approach 
is an interesting direction to work on. 



CONCLUSION 

The article presents the basic approach for Association 
Rule Mining, focusing on some common algorithms 
for finding frequent item-sets and frequent closed 
item-sets. Various approaches have been discussed to 
find such item-sets. Formal Concept Analysis approach 
for finding frequent closed item-sets is also discussed. 
Generation of rules from frequent items-sets and fre- 
quent closed item-sets is briefly discussed. The article 
addresses the scalability issues involved in various 
algorithms. 
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KEY TERMS 

Association Rule: An Association rule is an impli- 
cation of the form X— > Y where X cz I, Yd I and XnY 
=0, 1 denotes the set of items. 

Data Mining: Extraction of interesting, non-trivial, 
implicit, previously unknown and potentially useful 
information or patterns from data in large databases. 

Formal Concept: A formal context K = (G,MJ) 
consists of two sets G (objects) and M (attributes) 
and a relation I between G and M. For a set AczG of 
objects 

A'={msM | glm for all gsA} (the set of all attributes 
common to the objects in A). Correspondingly, for a 
set B of attributes we define 



B' = {gsG | glm for all msB} (the set of objects com- 
mon to the attributes in B). 

A formal concept of the context (G,MJ) is a pair (A,B) 
with AdG,BdM, 

A'=BandB'=A 

A is called the extent and B is the intent of the concept 
(A,B). 



Frequent Closed Item-Set: Anitem-setXis a closed 
item-set if there exists no item-set X' such that: 

i. X' is a proper superset of X, 
ii. Every transaction containing X also contains 
X\ 

A closed item-set X is frequent if its support exceeds 
the given support threshold. 

Galois Connection: Let D = (0,l,R) be a data 
mining context where O and I are finite sets of objects 
(transactions) and items respectively. R c O x I is a 
binary relation between objects and items. For OcO, 
and I cz I, we define as shown in Exhibit C. 

f(O) associates with O the items common to all 
objects o e O and g(I) associates with I the objects 
related to all items z e I. The couple of applications 
(fg) is a Galois connection between the power set of 
O (i.e. 2°) and the power set of I (i.e. 2 1 ). 

The operators h = fo g in 2 1 and h ' = g o f in 2° are 
Galois closure operators. An item-set C cz I from D is 
a closed item-set iff h(C) = C. 

Generator Item-Set: A generator p of a closed 
item-set c is one of the smallest item-sets such that 
h(p) = c. 

Non-Redundant Association Rules: Let R denote 

i 

the rule Xf-tXJ, where X 1 X 2 c I. Rule R T is more general 
than rule R 2 provided R 2 can be generated by adding 
additional items to either the antecedent or consequent 
of R v Rules having the same support and confidence as 



Exhibit C. 



f(O): 2° -> 2 1 

f(0) = (ie I | Vo e O, (o,i) e R} 



9(1): 2 1 -> 2 U 

g(I) = (0G 0| Vi eI,(o,i) e R} 
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more general rules are the redundant association rules. 
Remaining rules are non-redundant rules. 
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INTRODUCTION 

In recent years much research and development effort 
has been directed towards the broad field of ambient 
intelligence (Ami), and this trend is set to continue for 
the foreseeable future. Ami aims at seamlessly integrat- 
ing services within smart infrastructures to be used at 
home, at work, in the car, on the move, and generally in 
most environments inhabited by people. It is a relatively 
new paradigm rooted in ubiquitous computing, which 
calls for the integration and convergence of multiple 
disciplines, such as sensor networks, portable devices, 
intelligent systems, human-computer and social interac- 
tions, as well as many techniques within artificial intel- 
ligence, such as planning, contextual reasoning, speech 
recognition, language translation, learning, adaptability, 
and temporal and hypothetical reasoning. 

The term Ami was coined by the European Com- 
mission, when in 2001 one of its Programme Advisory 
Groups launched the Ami challenge (Ducatel et al., 
2001), later updated in 2003 (Ducatel et al., 2003). But 
although the term Ami originated from Europe, the goals 
of the work have been adopted worldwide, see for ex- 
ample (The Aware Home, 2007), (The Oxygen Project, 
2007), and (The Sony Interaction Lab, 2007). 

The foundations of Ami infrastructures are based on 
the impressive progress we are witnessing in wireless 
technologies, sensor networks, display capabilities, 
processing speeds and mobile services. These devel- 
opments help provide much useful (row) information 
for Ami applications. Further progress is needed in 
taking full advantage of such information in order 
to provide the degree of intelligence, flexibility and 
naturalness envisaged. This is where artificial intel- 
ligence and multi-agent techniques have important 
roles to play. 

In this paper we will review the progress that has 
been made in intelligent systems, discuss the role of 



artificial intelligence and agent technologies and focus 
on the application of Ami for independent living. 



BACKGROUND 

Ambient intelligence is a vision of the information 
society where normal working and living environments 
are surrounded by embedded intelligent devices that 
can merge unobtrusively into the background and 
work through intuitive interfaces. Such devices, each 
specialised in one or more capabilities, are intended to 
work together within an infrastructure of intelligent 
systems, to provide a multitude of services aimed at 
generally improving safety and security and improving 
quality of life in ordinary living, travelling and work- 
ing environments. 

The European Commission identified four Ami sce- 
narios (Ducatel et al. 2001, 2003) in order to stimulate 
imagination and initiate and structure research in this 
area. We summarise two of these to provide the flavour 
of Ami visions. 

Ami Scenarios: 

1. Dimitrios is taking a coffee break and prefers not to 
be disturbed. He is wearing on his clothes or body a 
voice activated digital avatar of himself, known as 
Digital Me (D-Me). D-Me is both a learning device, 
learning about Dimitrios and his environment, and 
an acting device offering communication, process- 
ing and decision-making functionalities. During 
the coffee break D-Me answers the incoming calls 
and emails of Dimitrios. It does so smoothly in 
the necessary languages, with a re-production of 
Dimitrios' voice and accent. Then D-Me receives 
a call from Dimitrios' wife, recognises its urgency 
and passes it on to Demetrios. At the same time it 
catches a message from an older person's D-Me, 
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located nearby. This person has left home without 
his medication and would like to find out where 
to access similar drugs. He has asked his D-Me, 
in natural language, to investigate this. Dimitrios 
happens to suffer from a similar health problem 
and uses the same drugs. His D-Me processes the 
incoming request for information, and decides 
neither to reveal Dimitrios' identity nor offer direct 
help, but to provide the elderly person's D-Me with 
a list of the closest medicine shops and potential 
contact with a self-help group. 
2. Carmen plans her journey to work. It asks Ami, by 
voice command, to find her someone with whom 
she can share a lift to work in half an hour. She then 
plans the dinner party she is to give that evening. 
She wishes to bake a cake, and her e-fridge flashes 
a recipe on the e-fridge screen and highlights the 
ingredients that are missing. Carmen completes 
her shopping list on the screen and asks for it to 
be delivered to the nearest distribution point in her 
neighbourhood. All goods are smart tagged, so she 
can check the progress of her virtual shopping from 
any enabled device anywhere, and make alterations. 
Carmen makes her journey to work, in a car with 
dynamic traffic guidance facilities and traffic sys- 
tems that dynamically adjust speed limits depend- 
ing on congestion and pollution levels. When she 
returns home the Ami welcomes her and suggests 
that on the next day she should telework, as a big 
demonstration is planned in downtown. 



The demands that drive Ami and provide opportuni- 
ties are for improvement of safety and quality of life, 
enhancements of productivity and quality of products 
and services, including public services such as hospitals, 
schools, military and police, and industrial innovation. 
Ami is intended to facilitate human contact and com- 
munity and cultural enhancement, and ultimately it 
should inspire trust and confidence. 

Some of the technologies required for Ami are 
summarised in Figure 1. 

Ami work builds on ubiquitous computing and sen- 
sor network and mobile technologies. To provide the 
intelligence and naturalness required, it is our view that 
significant contributions can come from advances in 
artificial intelligence and agent technologies. Artificial 
intelligence has a long history of research on plan- 
ning, scheduling, temporal reasoning, fault diagnosis, 
hypothetical reasoning, and reasoning with incomplete 
and uncertain information. All of these are techniques 
that can contribute to Ami where actions and decisions 
have to be taken in real time, often with dynamic and 
uncertain knowledge about the environment and the 
user. Agent technology research has concentrated on 
agent architectures that combine several, often cogni- 
tive, capabilities, including reactivity and adaptability, 
as well as the formation of agent societies through 
communication, norms and protocols. 

Recent work has attempted to exploit these tech- 
niques for Ami. In (Augusto and Nugent 2004) the 
use of temporal reasoning combined with active data- 



Figure 1. Components of Ambient Intelligence 
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bases are explored in the context of smart homes. In 
(Sadri 2007) the use of temporal reasoning together 
with agents is explored to deal with similar scenarios, 
where information observed in a home environment is 
evaluated, deviations from normal behaviour and risky 
situations are recognised and compensating actions are 
recommended. 

The relationship of Ami to cognitive agents is 
motivated by (Stathis and Toni 2004) who argue that 
computational logic elevates the level of the system to 
that of a user. They advocate the KGP agent model 
(Kakas, et al 2004) to investigate how to assist a trav- 
eller to act independently and safely in an unknown 
environment using a personal communicator. ( Augusto 
et al 2006) address the process of taking decisions in 
the presence of conflicting options. (Li and Ji 2005) 
offer a new probabilistic framework based on Bayesian 
Networks for dealing with ambiguous and uncertain 
sensory observations and users' changing states, in 
order to provide correct assistance. 

(Amigoni et al 2005) address the goal-oriented as- 
pect of Ami applications, and in particular the planning 
problem within Ami. They conclude that a combination 
of centralised and distributed planning capabilities are 
required, due to the distributed nature of Ami and the 
participation of heterogeneous agents, with different 
capabilities. They offer an approach based on the Hi- 
erarchical Task Networks taking the perspective of a 
multi-agent paradigm for Ami. 

The paradigm of embedded agents for Ami en- 
vironments with a focus on developing learning and 
adaptation techniques for the agents is discussed in 
(Hagras et al 2004, and Hagras and Callaghan 2005). 
Each agent is equipped with sensors and effectors and 
uses a learning system based on fuzzy logic. A real Ami 
environment in the form of an "intelligent dormitory" 
is used for experimentation. 

Privacy and security in the context of Ami appli- 
cations at home, at work, and in the health, shopping 
and mobility domains are discussed in (Friedewald et 
al 2007). For such applications they consider security 
threats such as surveillance of users, identity theft and 
malicious attacks, as well as the potential of the digital 
divide amongst communities and social pressures. 



AMBIENT INTELLIGENCE FOR 
INDEPENDENT LIVING 

One major use of Ami is to support services for in- 
dependent living, to prolong the time people can 
live decently in their own homes by increasing their 
autonomy and self-confidence. This may involve the 
elimination of monotonous everyday activities, moni- 
toring and caring for the elderly, provision of security, 
or saving resources. The aim of such Ami applications 
is to help: 

maintain safety of a person by monitoring his en- 
vironment and recognizing and anticipating risks, 
and taking appropriate actions, 
provide assistance in daily activities and require- 
ments, for example, by reminding and advising 
about medication and nutrition, and 
improve quality of life, for example by providing 
personalized information about entertainment and 
social activities. 

This area has attracted a great deal of attention in 
recent years, because of increased longevity and the 
aging population in many parts of the world. For such 
an Ami system to be useful and accepted it needs to be 
versatile, adaptable, capable of dealing with changing 
environments and situations, transparent and easy, and 
even pleasant, to interact with. 

We believe that it would be promising to explore 
an approach based on providing an agent architec- 
ture consisting of a society of heterogeneous, intel- 
ligent, embedded agents, each specialised in one or 
more functionalities. The agents should be capable 
of sharing information through communication, and 
their dialogues and behaviour should be governed by 
context-dependent and dynamic norms. 

The basic capabilities for intelligent agents in- 
clude: 

Sensing: to allow the agent observe the environ- 
ment 

Reactivity: to provide context-dependent dynamic 
behaviour and the ability to adapt to changes in 
the environment 

Planning: to provide goal-directed behaviour 
Goal Decision: to allow dynamic decisions about 
which goals have higher priorities 
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Action execution: to allow the agent to affect the 
environment. 

All of these functionalities also require reasoning 
about spatio-temporal constraints reflecting the envi- 
ronment in which an Ami system operates. 

Most of these functionalities have been integrated 
in the KGP model (Kakas et al, 2004), whose archi- 
tecture is shown in Figure 2 and implemented in the 
PROSOCS system (Bracciali et al, 2006). The use of 
reactivity for communication and dialogue policies 
has also been discussed in, for example, (Sadri et al, 
2003). The inclusion of normative behaviour has been 
discussed in (Sadri et al, 2006) where we also consider 
how to choose amongst different types of goals, depend- 
ing on the governing norms. For a general discussion 
on the importance of norms in artificial societies see 
(Pitt, 2005). 

KGP agents are situated in the environment via 
their physical capabilities. Information received from 
the environment (including other agents) updates the 
agents state and provides input to its dynamic cycle 
theory, which, in turn, determines the next steps in terms 
of its transitions, using its reasoning capabilities. 



FUTURE TRENDS 

As most other information and communication tech- 
nologies, Ami is not likely to be good or bad on its 
own, but its value will be judged from the different 



Figure 2. The architecture of a KGP agent 
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ways the technology will be used to improve people's 
lives. In this section we discuss new opportunities and 
challenges for the integration of Ami with what people 
do in ordinary settings. We abstract away from hardware 
trends and we focus on areas that are software related 
and are likely to play an important role in the adoption 
of Ami technologies. 

A focal point is the observation that people discover 
and understand the world through visual and conver- 
sational interactions. As a result, in the coming years 
we expect to see the design of Ami systems to focus in 
ways that will allow humans to interact in natural ways, 
using their common skills such as speaking, gesturing, 
glancing. This kind of natural interaction (Leibe et al 
2000) will complement existing interfaces and will 
require that Ami systems be capable of representing 
virtual objects, possibly in 3D, as well as capture 
people's moves in the environment and identify which 
of these moves are directed to virtual objects. 

We also expect to see new research directed towards 
processing of sensor data with different information 
(Massaro and Friedman 1990) and different kind of 
formats such as audio, video, and RFID. Efficient 
techniques to index, search, and structure these data 
and ways to transform them to the higher-level semantic 
information required by cognitive agents will be an 
important area for future work. Similarly, the reverse 
of this process is likely to be of equal importance, 
namely, how to translate high-level information to 
the lower-level signals required by actuators that are 
situated in the environment. 

Given that sensors and actuators will provide the 
link with the physical environment, we also anticipate 
further research to address the general linking of Ami 
systems to already existing computing infrastructures 
such as the semantic web. This work will create hybrid 
environments that will need to combine useful informa- 
tion from existing wired technologies with information 
from wireless ones (Stathis et al 2007). To enable the 
creation of such environments we imagine the need 
to build new frameworks and middleware to facilitate 
integration of heterogeneous Ami systems and make 
the interoperation more flexible. 

Another important issue is how the human experi- 
ence in Ami will be managed in a way that will be as 
unobtrusive as possible. In this we foresee that develop- 
ments in cognitive systems will play a very important 
role. Although there will be many areas of cognitive 
system behaviour that will need to be addressed, we 
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anticipate that development of agent models that adapt 
and learn (Sutton and Barto 1998), to be of great im- 
portance. The challenge here will be how to integrate 
the output of these adaptive and learning capabilities 
to the reasoning and decision processes of the agent. 
The resulting cognitive behaviour must differentiate 
between newly learned concepts and existing ones, as 
well as discriminate between normal behaviour and 
exceptions. 

We expect that Ami will emerge with the formation 
of user communities who live and work in a particular 
locality (Stathis et al 2006). The issue then becomes 
how to manage all the information that is provided and 
captured as the system evolves. We foresee research to 
address issues such as semantic annotations of content, 
and partitioning and ownership of information. 

Linking in local communities with smart homes, 
e-healthcare, mobile commerce, and transportation 
systems will eventually give rise to a global Ami sys- 
tem. For applications in such a system to be embraced 
by people we will need to see specific human factors 
studies to decide how unobtrusive, acceptable and 
desirable the actions of the Ami environment seem 
to people who use them. Some human factors studies 
should focus on issues of presentation of objects and 
agents in a 3D setting, as well as on the important is- 
sues of privacy, trust and security. 

To make possible the customization of system in- 
teractions to different classes of users, it is required to 
acquire and store information about these users. Thus 
for people to trust Ami interactions in the future we 
must ensure that the omnipresent intelligent environ- 
ment maintains privacy in an ethical manner. Ethical 
or, better, normative behaviour cannot only be ensured 
at the cognitive level (Sadri et al 2006), but also at the 
lower, implementation level of the Ami platform. In 
this context, ensuring that communicated information 
is encrypted, certified, and follows transparent security 
policies will be required to build systems less vulner- 
able to malicious attacks. Finally, we also envisage 
changes to business models that would characterise 
Ami interactions (Hax and Wielde 2001). 



CONCLUSION 

The successful adoption of Ami is predicated on the 
suitable combination of ubiquitous computing, artificial 
intelligence and agent technologies. A useful class of 



applications that can test such a combination is Ami 
supporting independent living. For such applications 
we have identified the trends that are likely to play an 
important role in the future. 
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TERMS AND DEFINITIONS 

Artificial Societies: Complex systems consisting 
of a, possibly large, set of agents whose interaction 
are constrained by norms and the roles the agents are 
responsible to play. 

Cognitive Agents: Software agents endowed with 
high-level mental attitudes, such as beliefs, goals and 
plans. 

Context Awareness: Refers to the idea that comput- 
ers can both sense and react according to the state of 
the environment they are situated. Devices may have 
information about the circumstances under which they 
are able to operate and react accordingly. 

Natural Interaction: The investigation of the re- 
lationships between humans and machines aiming to 
create interactive artifacts that respect and exploit the 
natural dynamics through which people communicate 
and discover the real world. 

Smart Homes: Homes equipped with intelligent 
sensors and devices within a communications infra- 
structure that allows the various systems and devices 
to communicate with each other for monitoring and 
maintenance purposes. 
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Ubiquitous Computing: A model of human-com- 
puter interaction in which information processing is 
integrated into everyday objects and activities. Unlike 
the desktop paradigm, in which a single user chooses to 
interact with a single device for a specialized purpose, 
with ubiquitous computing a user interacts with many 
computational devices and systems simultaneously, in 
the course of ordinary activities, and may not neces- 
sarily even be aware that is doing so. 

Wireless Sensor Networks: Wireless networks 
consisting of spatially distributed autonomous devices 
using sensors to cooperatively monitor physical or 
environmental conditions, such as temperature, sound, 
vibration, pressure, motion or pollutants, at different 
locations. 
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INTRODUCTION 

The trend in the direction of hardware cost reduction 
and miniaturization allows including computing de- 
vices in several objects and environments (embedded 
systems). Ambient Intelligence (Ami) deals with a new 
world where computing devices are spread everywhere 
(ubiquity), allowing the human being to interact in 
physical world environments in an intelligent and un- 
obtrusive way. These environments should be aware 
of the needs of people, customizing requirements and 
forecasting behaviours. 

Ami environments may be so diverse, such as 
homes, offices, meeting rooms, schools, hospitals, 
control centers, transports, touristic attractions, stores, 
sport installations, and music devices. 

Ambient Intelligence involves many different disci- 
plines, like automation (sensors, control, and actuators), 
human-machine interaction and computer graphics, 
communication, ubiquitous computing, embedded 
systems, and, obviously, Artificial Intelligence. In the 
aims of Artificial Intelligence, research envisages to 
include more intelligence in the Ami environments, 
allowing a better support to the human being and the 
access to the essential knowledge to make better deci- 
sions when interacting with these environments. 



BACKGROUND 

Ambient Intelligence (Ami) is a concept developed 
by the European Commission's 1ST Advisory Group 
ISTAG (ISTAG, 2001)(ISTAG, 2002). ISTAG believes 
that it is necessary to take a holistic view of Ambient 
Intelligence, considering not just the technology, but 
the whole of the innovation supply-chain from sci- 
ence to end-user, and also the various features of the 
academic, industrial and administrative environment 
that facilitate or hinder realisation of the Ami vision 
(ISTAG, 2003). Due to the great amount of technolo- 
gies involved in the Ambient Intelligence concept we 



may find several works that appeared even before the 
ISTAG vision pointing in the direction of Ambient 
Intelligence trends. 

In what concerns Artificial Intelligence (AI), Ambi- 
ent Intelligence is a new meaningful step in the evolution 
of AI (Ramos, 2007). AI has closely walked side-by-side 
with the evolution of Computer Science and Engineer- 
ing. The building of the first artificial neural models and 
hardware, with the Walter Pitts and Warren McCullock 
work (Pitts & McCullock, 1943) and Marvin Minsky 
and Dean Edmonds SNARC system correspond to the 
first step. Computer-based Intelligent Systems, like the 
MYCIN Expert System (Shortliffe, 1976) or network- 
based Intelligent Systems, like AUTHORIZER's AS- 
SISTANT (Rothi, 1990) used by American Express for 
authorizing transactions consulting several Data Bases 
are the kind of systems of the second step of AI. From 
the 80's Intelligent Agents and Multi- Agent Systems 
have established the third step, leading more recently 
to Ontologies and Semantic Web. From hardware to 
the computer, from the computer to the local network, 
from the local network to the Internet, and from the 
Internet to the Web, Artificial Intelligence was on the 
state of the art of computing, most of times a little bit 
ahead of the technology limits. 

Now the centre is no more in the hardware, or in 
the computer, or even in the network. Intelligence must 
be provided to our daily-used environments. We are 
aware of the push in the direction of Intelligent Homes, 
Intelligent Vehicles, Intelligent Transportation Systems, 
Intelligent Manufacturing Systems, even Intelligent 
Cities. This is the reason why Ambient Intelligence 
concept is so important nowadays (Ramos, 2007). 

Ambient Intelligence is not possible without Artifi- 
cial Intelligence. On the other hand, AI researchers must 
be aware of the need to integrate their techniques with 
other scientific communities ' techniques (e.g. Automa- 
tion, Computer Graphics, Communications). Ambient 
Intelligence is a tremendous challenge, needing the 
better effort of different scientific communities. 
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There is a miscellaneous of concepts and tech- 
nologies related with Ambient Intelligence. Ubiquitous 
Computing, Pervasive Computing, Embedded Sys- 
tems, and Context Awareness are the most common. 
However these concepts are different from Ambient 
Intelligence. 

The concept of Ubiquitous Computing (UbiComp) 
was introduced by Mark Weiser during his tenure as 
Chief Technologist of the Palo Alto Research Center 
(PARC) (Weiser, 1991). Ubiquitous Computing means 
that we have access to computing devices anywhere in 
an integrated and coherent way. Ubiquitous Computing 
was mainly driven by Communications and Comput- 
ing devices scientific communities but now is involv- 
ing other research areas. Ambient Intelligence differs 
from Ubiquitous Computing because sometimes the 
environment where Ambient Intelligence is considered 
is simply local. Another difference is that Ambient 
Intelligence makes more emphasis on intelligence 
than Ubiquitous Computing. However, ubiquity is a 
real need today and Ambient Intelligence systems are 
considering this feature. 

A concept that sometimes is seen as a synonymous 
of Ubiquitous Computing is Pervasive Computing. 
According to Teresa Dillon, Ubiquitous Computing 
is best considered as the underlying framework, the 
embedded systems, networks and displays which are 
invisible and everywhere, allowing us to 'plug-and- 
play' devices and tools, On the other hand, Pervasive 
Computing, is related with all the physical parts of 
our lives; mobile phone, hand-held computer or smart 
jacket (Dillon, 2006). 

Embedded Systems mean that electronic and 
computing devices are embedded in current objects or 
goods. Today goods like cars are equipped with mi- 
croprocessors; the same is true for washing machines, 
refrigerators, and toys. Embedded Systems community 
is more driven by electronics and automation scientific 
communities. Current efforts go in the direction to in- 
clude electronic and computing devices in the most usual 
and simple objects we use, like furniture or mirrors. 
Ambient Intelligence differs from Embedded Systems 
since computing devices maybe clearly visible in Ami 
scenarios. However, there is a clear trend to involve 
more embedded systems in Ambient Intelligence. 

Context Awareness means that the system is aware 
of the current situation we are dealing with. An example 
is the automatic detection of the current situation in a 
Control Centre. Are we in presence of a normal situation 



or are we dealing with a critical situation, or even an 
emergency? In this Control Centre the intelligent alarm 
processor will exhibit different outputs according to the 
identified situation (Vale, Moura, Fernandes, Marques, 
Rosado, Ramos, 1997). Automobile Industry is also 
investing in Context Aware systems, like near-accident 
detection. Human-Computer Interaction scientific com- 
munity is paying lots of attention to Context Awareness. 
Context Awareness is one of the most desired concepts 
to include in Ambient Intelligence, the identification 
of the context is important for deciding to act in an 
intelligent way. 

There are different views of the importance of other 
concepts and technologies in the Ambient Intelligence 
field. Usually these differences are derived from the 
basic scientific community of the authors. ISTAG see 
the technology research requirements from different 
points of view (Components, Integration, System, and 
User/Person). In (ISTAG, 2003) the following ambient 
components are mentioned: smart materials; MEMS 
and sensor technologies; embedded systems; ubiquitous 
communications; I/O device technology; adaptive soft- 
ware. In the same document ISTAG refers the following 
intelligence components: media management and han- 
dling; natural interaction; computational intelligence; 
context awareness; and emotional computing. 

Recently Ambient Intelligence is receiving a 
significant attention from Artificial Intelligence Com- 
munity. We may refer the Ambient Intelligence Work- 
shops organized by Juan Augusto and Daniel Shapiro 
at EC AF 2006 (European Conference on Artificial 
Intelligence) and IJCAF2007 (International Joint 
Conference on Artificial Intelligence) and the Special 
Issue on Ambient Intelligence, coordinated by Carlos 
Ramos, Juan Augusto and Daniel Shapiro to appear 
in the March/April' 2008 issue of the IEEE Intelligent 
Systems magazine. 



AMBIENT INTELLIGENT PROTOTYPES 
AND SYSTEMS 

Here we will analyse some examples of Ambient Intel- 
ligence prototypes and systems, divided by the area of 
application. 
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Ami at Home 

Domotics is a consolidated area of activity. After the 
first experiences using Domotics at homes there was a 
trend to refer the Intelligent Home concept. However, 
Domotics is too centred in the automation, giving to the 
user the capability to control the house devices from 
everywhere. We are still far from the real Ambient 
Intelligence in homes, at least at the commercial level. 
In (Wichert, Hellschimidt, 2006) there is an interesting 
example in the aims of EMBASSI project, by gesture a 
woman is commanding the TV to be brighter, however 
the TV is already at the brightest level, so the lights 
reduce the level and the windows close, showing an 
example of context awareness in the environment. 

Several organizations are doing experiments to 
achieve the Intelligent Home concept. Some examples 
are HomeLab from Philips, MIT House_n, Georgia 
Tech Aware Home, Microsoft Concept Home, and e2 
Home from Electrolux and Ericsson. 

Ami in Vehicles and Transports 

Since the first experiences with NAVLAB 1 (Thorpe, 
Herbert, Kanade, Shafer, 1988) Carnegie Mellon Uni- 
versity has developed several prototypes for Autono- 
mous Vehicle Driving and Assistance. The last one, 
NAVLAB 11, is an autonomous Jeep. Most of the car 
industry companies are doing research in the area of 
Intelligent Vehicles for several tasks like car parking 
assistance or pre-collision detection. 

Another example of Ami application is related 
with Transports, namely in connection with Intelligent 
Transportation Systems (ITS). The ITS Joint Program of 
the US Department of Transportation identified several 
areas of applications, namely: arterial management; 
freeway management; transit management; incident 
management; emergence management; electronic pay- 
ment; traveller information; information management; 
crash prevention and safety; roadway operations and 
management; road weather management; commercial 
vehicle operations; and intermodal freight. In all these 
application areas Ambient Intelligence can be used. 



problems. The percentage of population with health 
problems will increase and it will be very difficult to 
Hospitals to maintain all patients. Our society is faced 
with the responsibility to care for these people in the 
best possible social and economical ways. So, there is 
a clear interest to create Ambient Intelligence devices 
and environments allowing the patients to be followed 
in their own homes or during their day-by-day life. 

The medical control support devices may be em- 
bedded in clothes, like T-shirts, collecting vital-sign 
information from sensors (e. g. blood pressure, tem- 
perature). Patients will be monitored at long distance. 
The surrounding environment, for example the patient 
home, may be aware of the results from the clinical 
data and even perform emergency calls to order an 
ambulance service. 

For instance, we may refer the 1ST Vivago® system 
(1ST International Security Technology Oy, Helsinki, 
Finland), an active social alarm system, which combines 
intelligent social alarms with continuous remote moni- 
toring of the user's activity profile (Sarela, Korhonen, 
Lotjonen, Sola, Myllymaki, 2003). 

Ami in Tourism and Cultural Heritage 

Tourism and Cultural Heritage are good application 
areas for Ambient Intelligence. Tourism is a grow- 
ing industry. In the past tourists were satisfied with 
pre-defined tours, equal for all the people. However 
there is a trend in the customization and the same tour 
can be conceived to adapt to tourists according their 
preferences. 

Immersive tour post is an example of such experi- 
ence (Park, Nam, Shi, Golub, Van Loan, 2006). MEGA 
is an user-friend virtual-guide to assist visitors in the 
ParcoArcheologicodellaValle del Temple inAgrigento, 
an archaeological area with ancient Greek temples in 
Agrigento, located in Sicily, Italy (Pilato, Augello, 
Santangelo, Gentile, Gaglio, 2006). DALICAhasbeen 
used for constructing and updating the user profile of 
visitors of Villa Adriana in Tivoli, near Rome, Italy 
(Constantini, Inverardi, Mostarda, Tocchio, Tsintza, 
2007). 



Ami in Elderly and Health Care 

Several studies point to the aging of population dur- 
ing the next decades. While being a good result of 
increasing of life expectation, this also implies some 



Ami at Work 

The human being spends considerable time in work- 
ing places like offices, meeting rooms, manufacturing 
plants, control centres. 
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SPARSE is a project initially created for helping 
Power Systems Control Centre Operators in the diagno- 
sis and restoration of incidents (Vale, Moura, Fernandes, 
Marques, Rosado, Ramos, 1997). It is a good example 
of context awareness since the developed system is 
aware of the on-going situation, acting in different 
ways according the normal or critical situation of the 
power system. This system is evolving for an Ambient 
Intelligence framework applied to Control Centres. 

Decision Making is one of the most important 
activities of the human being. Nowadays decisions 
imply to consider many different points of view, so 
decisions are commonly taken by formal or informal 
groups of persons. Groups exchange ideas or engage 
in a process of argumentation and counter-argumenta- 
tion, negotiate, cooperate, collaborate or even discuss 
techniques and/or methodologies for problem solving. 
Group Decision Making is a social activity in which 
the discussion and results consider a combination of 
rational and emotional aspects. ArgEmotionAgents 
is a project in the area of the application of Ambient 
Intelligence in the group argumentation and decision 
support considering emotional aspects and running in 
the Laboratory of Ambient Intelligence for Decision 
Support (LAID), seen in Figure 1 (Marreiros, Santos, 
Ramos, Neves, Novais, Machado, Bulas-Cruz, 2007), 
a kind of an Intelligent Decision Room. This work has 
also a part involving ubiquity support. 

Ami in Sports 

Sports involve high-level athletes and many more prac- 
titioners. Many sports are done without any help of the 
associated devices, opening here a clear opportunity 
for Ambient Intelligence to create sports assistance 
devices and environments. 



FlyMaster NAV+ is a free-flight on-board pilot As- 
sistant (e.g. gliding, paragliding), using the FlyMaster 
Fl module with access to GPS and sensorial informa- 
tion. FlyMaster Avionics S.A., a spin-off, was created 
to commercialize these products (see figure 2). 



AMBIENT INTELLIGENCE PLATFORMS 

Some companies and academic institutions are invest- 
ing in the creation of Ambient Intelligence generation 
platforms. 

The Endeavour project is developed by the California 
University in Berkeley (http://endeavour.cs.berkeley. 
edu/). The project aims to specify, design, and imple- 
ment prototypes at a planet scale, self organized and 
involving an adaptive "Information Utility". 

Oxygen enables pervasive human centred comput- 
ing through a combination of specific user and system 
technologies (http://www.oxygen.lcs.mit.edu/). This 
project provides speech and vision technologies en- 
abling us to communicate with Oxygen as if we were 
interacting with another person, saving much time and 
effort (Rudolph, 2001). 

The Portolano project was developed in the Uni- 
versity of Washington and seeks to create a testbed for 
research into the emerging field of invisible computing 
(http://portolano.cs.washington.edu/). The invisible 
computing is possible with devices so highly optimized 
to particular tasks that they bend into the world and 
require little technical knowledge from the users (Esler, 
Hightower, Anderson, Borrielo, 1999). 

The EasyLiving project of Microsoft Research 
Vision Group corresponds to a prototype architecture 
and associated technologies for building intelligent 
environments (Brumitt, Meyers, Krumm, Kern, Shaf er, 



Figure 1. Ambient Intelligence for decision support, LAID Laboratory 
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Figure 2. FlyMaster Pilot Assistant device, from FlyMaster Avionics S.A. 




2000). EasyLiving goal is to facilitate the interaction 
of people with other people, with computer, and with 
devices (http://research.microsoft.com/easyliving/). 



FUTURE TRENDS 

Ambient Intelligence deals with a futuristic notion for 
our lives. Most of the practical experiences concerning 
Ambient Intelligence are still in a very incipient phase, 
due to the recent existence of this concept. Today, it is 
not clear the separation between the computer and the 
environments. However, for new generations things will 
be more transparent, and environments with Ambient 
Intelligence will be more widely accepted. 

In the area of transport, Ami will cover several 
aspects. The first will be related with the vehicle itself. 
Several performances start to be available, like the au- 
tomatic identification of the situation (e.g. pre-collision 
identification, identification of the driver conditions). 
Other aspects will be related with the traffic information. 
Today, GPS devices are generalized, but they deal with 
static information. Joining on-line traffic conditions 
will enable the driver to avoid roads with accidents. 
Technology is giving good steps in the direction of 
automatic vehicle driving. But in the near future the 
developed systems will be seen more like driver as- 
sistants in spite of autonomous driving systems. 

Another area where Ami will experience a strong 
development will be the area of Health Care, especially 



in the Elderly Care. Patients will receive this support to 
allow a more autonomous life in their homes. However 
automatic acquisition of vital signals (e.g. blood pres- 
sure, temperature) will allow to do automatic emergency 
calls when the patient health is in significant trouble. 
The person monitoring will also be done in his/her 
home, trying to detect differences in expected situa- 
tions and habits. 

The home support will achieve the normal personal 
and family life. Intelligent Homes will be a reality. The 
home residents will pay less attention to normal home 
management aspects, for example, how many bottles 
of red wine are available for the week meals or if the 
specific ingredients for a cake are all available. 

Ami for job support are also expected. Decision 
Support Systems will be oriented to on-the-job envi- 
ronments. This will be clear in offices, meeting rooms, 
call centres, control centres, and plants. 



CONCLUSION 

This article presents the state of the art in which con- 
cerns Ambient Intelligence field. After the history of 
the concept, we established some related concepts 
definitions and illustrated with some examples. There 
is a long way to follow in order to achieve the Ambi- 
ent Intelligence concept, however in the future, this 
concept will be referred as one of the landmarks in the 
Artificial Intelligence development. 
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TERMS AND DEFINITIONS 

Ambient Intelligence: Ambient Intelligence (Ami) 
deals with a new world where computing devices are 
spread everywhere, allowing the human being to interact 
in physical world environments in an intelligent and 
unobtrusive way. These environments should be aware 
of the needs of people, customizing requirements and 
forecasting behaviours. 

Context Awareness: Context Awareness means 
that the system is aware of the current situation we 
are dealing with. 

Embedded Systems: Embedded Systems means 
that electronic and computing devices are embedded 
in current objects or goods. 

Intelligent Decision Room: A decision-making 
space, eg a meeting room or a control center, equipped 
with intelligent devices and/or systems to support deci- 
sion-making processes. 

Intelligent Home: A home equipped with several 
electronic and interactive devices to help residents to 
manage conventional home decisions. 

Intelligent Transportation Systems: Intelligent 
Systems applied to the area of Transports, namely to 
traffic and travelling issues. 

Intelligent Vehicles: A vehicle equipped with sen- 
sors and decision support components. 

Pervasive Computing: Pervasive Computing is 
related with all the physical parts of our lives, the user 
may have not notion of the computing devices and 
details related with these physical parts. 

Ubiquitous Computing: Ubiquitous Computing 
means that we have access to computing devices any- 
where in an integrated and coherent way. 
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INTRODUCTION 

Accdrnig to rscheearch at Cmabrigde Uinervtisy, it 
deosn't mttaer in what oredr the Itteers in a wrod are, 
the olny iprmoetnt tihng is that the frist and Isat Itteer 
be at the rghitpclae. Tihs is bcuseae the human mnid 
deos not raed ervey Iteter by istlef, but the wrod as a 
wlohe. 1 

Unfortunately computing systems are not yet as 
smart as the human mind. Over the last couple of years 
a significant number of researchers have been focus- 
sing on noisy text analytics. Noisy text data is found in 
informal settings (online chat, SMS, e-mails, message 
boards, among others) and in text produced through 
automated speech recognition or optical character 
recognition systems. Noise can possibly degrade the 
performance of other information processing algo- 
rithms such as classification, clustering, summarization 
and information extraction. We will identify some of 
the key research areas for noisy text and give a brief 
overview of the state of the art. These areas will be, (i) 
classification of noisy text, (ii) correcting noisy text, 
(iii) information extraction from noisy text. We will 
cover the first one in this chapter and the later two in 
the next chapter. 

We define noise in text as any kind of difference 
in the surface form of an electronic text from the in- 
tended, correct or original text. We see such noisy text 
everyday in various forms. Each of them has unique 
characteristics and hence requires special handling. 
We introduce some such forms of noisy textual data 
in this section. 

Online Noisy Documents: E-mails, chat logs, scrap- 
book entries, newsgroup postings, threads in discussion 
fora, blogs, etc., fall under this category. People are 
typically less careful about the sanity of written content 
in such informal modes of communication. These are 
characterized by frequent misspellings, commonly 



and not so commonly used abbreviations, incomplete 
sentences, missing punctuations and so on. Almost 
always noisy documents are human interpretable, if 
not by everyone, at least by intended readers. 

SMS: Short Message Services are becoming more 
and more common. Language usage over SMS text sig- 
nificantly differs from the standard form of the language. 
An urge towards shorter message length facilitating 
faster typing and the need for semantic clarity, shape 
the structure of this non-standard form known as the 
texting language (Choudhury et. al., 2007). 

Text Generated by ASR Devices: ASR is the 
process of converting a speech signal to a sequence 
of words. An ASR system takes speech signal such 
as monologs, discussions between people, telephonic 
conversations, etc. as input and produces a string a 
words, typically not demarcated by punctuations as 
transcripts. An ASR system consists of an acoustic 
model, a language model and a decoding algorithm. 
The acoustic model is trained on speech data and their 
corresponding manual transcripts. The language model 
is trained on a large monolingual corpus. ASR convert 
audio into text by searching the acoustic model and 
language model space using the decoding algorithm. 
Most conversations at contact centers today between 
agents and customers are recorded. To do any process- 
ing of this data to obtain customer intelligence it is 
necessary to convert the audio into text. 

Text Generated by OCR Devices: Optical character 
recognition, or 'OCR', is a technology that allows digital 
images of typed or handwritten text to be transferred 
into an editable text document. It takes the picture of 
text and translates the text into Unicode or ASCII. . For 
handwritten optical character recognition, the rate of 
recognition is 80% to 90% with clean handwriting. 

Call Logs in Contact Centers: Today's contact cen- 
ters (also known as call centers, BPOs, KPOs) produce 
huge amounts of unstructured data in the form of call 
logs apart from emails, call transcriptions, SMS, chat 
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transcripts etc. Agents are expected to summarize an 
interaction as soon as they are done with it and before 
picking up the next one. As the agents work under im- 
mense time pressure hence the summary logs are very 
poorly written and sometimes even difficult for human 
interpretation. Analysis of such call logs are important 
to identify problem areas, agent performance, evolving 
problems etc. 

In this chapter we will be focussing on automatic 
classification of noisy text. Automatic text classifica- 
tion refers to segregating documents into different 
topics depending on content. For example, categorizing 
customer emails according to topics such as billing 
problem, address change, product enquiry etc. It has 
important applications in the field of email categori- 
zation, building and maintaining web directories e.g. 
DMoz, spam filter, automatic call and email routing 
in contact center, pornographic material filter and so 
on. 



from documents, each document is converted into a 
document vector. Documents are represented in a vec- 
tor space; each dimension of this space represents a 
single feature and the importance of that feature in that 
document gives the exact distance from the origin. The 
simplest representation of document vectors uses the 
binary event model, where if a feature; e V appears in 
document d., then the j th component of d. is 1 otherwise 
it is 0. One of the most popular statistical classification 
techniques is naive Bayes (McCallum, 1998). In the 
naive Bayes technique the probability of a document 
d. belonging to class c is computed as: 



Pr(c|d) = 



Pr(c,d) 
Pr(d) 



= Pr(c) Pr(d | c) 
Pr(d) 

oo Pr(c) Pr(d | c) 



NOISY TEXT CATEGORIZATION 

The text classification task is one of the learning 
models for a given set of classes and applying these 
models to new unseen documents for class assignment. 
This is an important component in many knowledge 
extraction tasks; real time sorting of email or files 
into folder hierarchies, topic identification to support 
topic-specific processing operations, structured search 
and/or browsing, or finding documents corresponding 
to long-term standing interests or more dynamic task- 
based interests. Two types of classifiers are generally 
commonly found viz. statistical classifiers and rule 
based classifiers. 

In statistical techniques a model is typically trained 
on a corpus of labelled data and once trained the system 
can be used for automatic assignment of unseen data. A 
survey of text classification can be found in the work 
by Aas & Eikvil (Aas & Eikvil, 1999). Given a train- 
ing document collection (D ={d 1 , d 2 , , d M } with true 

classes {y 1? y 2 , , y M } the task is to learn a model. 

This model is used for categorizing a new unlabelled 
document d u . Typically words appearing in the text are 
used as features. Other applications including search 
rely heavily on taking the markup or link structure of 
documents into account but classifiers only depend on 
the content of the documents or the collection of words 
present in the documents. Once features are extracted 



oo HPidjlc) 



The final approximation of the above equation refers 
to the naive part of such a model, i.e., the assumption 
of word independence which means the features are 
assumed to be conditionally independent, given the 
class variable. 

Rule-based learning systems have been adopted in 
the document classification problem since it has con- 
siderable appeal. They perform well at finding simple 
axis-parallel frontiers. Atypical rule-based classifica- 
tion scheme for a category, say C, has the form: 

Assign category C if antecedent or 

Do no assign category C if antecedent or 

The antecedent in the premise of a rule usually 
involves some kind of feature value comparison. A 
rule is said to cover a document or a document is said 
to satisfy a rule if all the feature value comparisons in 
the antecedent of the rule are true for the document. 
One of the well known works in the rule based text 
classification domain is RIPPER. Like a standard 
separate-and-conquer algorithm, it builds a rule set 
incrementally. When a rule is found, all documents 
covered by the rule are discarded including positive 
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and negative documents. The rule is then added to the 
rule set. The remaining documents are used to build 
other rules in the next iteration. 

In both statistical as well as rule based text clas- 
sification techniques, the content of the text is the sole 
determiner of the category to be assigned. However 
noise in the text distorts the content and hence read- 
ers can expect the categorization performance to get 
affected by noise in the text. Classifiers are essentially 
trained to identify correlation between extracted features 
(words) with different categories which can be later 
utilized to categorize new documents. For example, 
words like exciting offer get a free laptop might have 
stronger correlation with category spam emails than 
non-spam emails. Noise in text distorts this feature 
space excitinng ofer get frree lap top will be new set 
of features and the categorizer will not be able to re- 
late it to the spam emails category. The feature space 
explodes as the same feature can appear in different 
forms due to spelling errors, poor recognition, wrong 
transcription, etc. In the remaining part of this section 
we will give an overview how people have approached 
the problem of categorizing noisy text. 

Categorization of OCRed Documents 

Electronically recognized handwritten documents and 
documents generated from OCR process are typical 
examples of noisy text because of the errors introduced 
by the recognition process. Vinciarelli (Vinciarelli, 
2004) has studied the characteristics of noise present 
in such data and its effects on categorization accuracy. 
A subset of documents from the Reuters-21578 text 
classification dataset were taken and noise was intro- 
duced using two methods: first a subset of documents 
were manually written and recognized using an offline 
handwriting recognition system. In the second the OCR 
based extraction process was simulated by randomly 
changing a certain percentage of characters. According 
to them for recall values up to 60-70 percent depending 
on the sources, the categorization system is robust to 
noise even when the Term Error Rate is higher than 
40 percent. It was also observed that the results from 
the handwritten data appeared to be lower than those 
obtained from OCR simulations. Generic systems 
for text categorization based on statistical analysis of 
representative text corpora have been proposed (Bayer 
et. al., 1998). Features are extracted from training texts 
by selecting substrings from actual word forms and 



applying statistical information and general linguistic 
knowledge followed by dimensionality reduction 
by linear transformation. The actual categorization 
system is based on minimum least-squares approach. 
The system is evaluated on the tasks of categorizing 
abstracts of paper-based German technical reports and 
business letters concerning complaints. Approximately 
80% classification accuracy is obtained and it is seen 
that the system is very robust against recognition or 
typing errors. 

Issues with categorizing OCRed documents are also 
discussed by many other authors (Brooks & Teahan, 
2007), (Hoch, 1994) and (Taghva et. al., 2001). 

Categorization of ASRed Documents 

Automatic Speech Recognition (ASR) is simply the 
process of converting an acoustic signal to a sequence of 
words. Researchers have proposed different techniques 
for speech recognition tasks based on Hidden Markov 
model (HMM), neural networks, Dynamic time warp- 
ing (DTW) (Trentin & Gori, 2001). The performance 
of an ASR system is typically measured in terms of 
Word Error Rate (WER), which is derived from the 
Levenshtein distance, working at word level instead 
of character. WER can be computed as 



WER 



S+D + I 

N 



where S is the number of substitutions, D is the number 
of the deletions, I is the number of the insertions, and 
N is the number of words in the reference. Bahl et.al. 
(Bahl et. al. 1995) have built an ASR system and dem- 
onstrated its capability on benchmark datasets. 

ASR systems give rise to word substitutions, dele- 
tions and insertions, while OCR systems produce es- 
sentially word substitutions. Moreover, ASR systems 
are constrained by a lexicon and can give as output only 
words belonging to it, while OCR systems can work 
without a lexicon (this corresponds to the possibility 
of transcribing any character string) and can output 
sequences of symbols not necessarily corresponding 
to actual words. Such differences are expected to have 
strong influence on performance of systems designed 
for categorizing ASRed documents in comparison to 
categorization of OCRed documents. A lot of work on 
automatic call type classification for the purpose of 
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categorizing calls (Tang et al., 2003), call routing (Kuo 
and Lee, 2003; Haffner et al., 2003), obtaining call log 
summaries (Douglas et al., 2005), agent assisting and 
monitoring (Mishne et al., 2005) has appeared in the 
past.Here calls are classified based on the transcription 
from an ASR system. One interesting work on seeing 
effect of ASR noise on text classification was done on 
a subset of benchmark text classification dataset Re- 
uters-21578 2 (Agarwal et. al., 2007). They read out and 
automatically transcribed 200 documents and applied a 
text classifier trained on clean Reuters-21578 training 
corpus 3 . Surprisingly, in spite of high degree of noise, 
they did not observe much degradation in accuracy. 

Effect of Spelling Errors on 
Categorization 

Spelling errors are an integral part of written text — elec- 
tronic as well as non-electronic. Every reader reading 
this book must have been scolded by their teacher 
in school for spelling words wrongly! In this era of 
electronic text people have become less careful while 
writing resulting poorly written text containing ab- 
breviations, short forms, acronyms, wrong spellings. 
Such electronic text documents including email, chat 
log, postings, SMSs are sometimes difficult to interpret 
even for human beings. It goes without saying that text 
analytics on such noisy data is a non trivial task. 

Wrong spellings can affect automatic classification 
performance in multiple ways depending on the nature 
of the classification technique being used. In the case 
of statistical techniques, spelling differences distort the 
feature space. If training as well as the test data corpus 
are noisy, while learning the model the classifier will 
treat variants of the same words as different features. 
As a result the observed joint probability distribution 
will be different from the actual distribution. If the 
proportion of wrongly spelt words is high then the 
distortion can be significant and will hurt the accuracy 
of the resultant classifier. However, if the classifier 
is trained on a clean corpus and the test documents 
are noisy, then wrongly spelt words will be treated as 
unseen words and will not help in classification. In an 
unlikely situation a wrongly spelt word present in a 
test document may become a different valid feature 
and worse, may become a valid indicative feature of 
a different class. A standard technique in the text clas- 
sification process is feature selection which happens 
after feature extraction and before training. Feature 



selection typically employs some statistical measures 
over the training corpus and ranks features in order of 
the amount of information (correlation) they have with 
respect to the class labels of the classification task at 
hand. After the feature set has been ranked, the top 
few features are retained (typically order of hundreds 
or a few thousand) and the others are discarded. Feature 
selection should be able to eliminate wrongly spelt 
words present in the training data provided (i) the 
proportion of wrongly spelt words is not very large 
and (ii) there is no regular pattern in spelling errors 4 . 
However it has been observed, even at high degree 
of spelling errors the classification accuracy does not 
suffer much (Agarwal et al., 2007). 

Rule based classification techniques also get nega- 
tively affected by spelling errors. If the training data 
contains spelling errors then some of the rules may 
not get the required statistical significance. Due to 
spelling errors present in the test data a valid rule may 
not fire and worse, an invalid rule may fire leading to 
a wrong categorization. Suppose RIPPER has learnt 
a rule set like: 

Assign category "sports " IF 
(the document contains {\it sports}) OR 
(the document contains {\it exercise} AND {\it out- 
door}) OR 

(the document contains {\it exercise} but not {\it home- 
work} {\it exam}) OR 
(the document contains {\it play} AND {\it rule}) OR 



A hypothetical test document containing repeated 
occurrences of exercise, but each time wrongly spelt as 
exarcise, will not be categorized to the sports category 
and hence lead to misclassification. 



CONCLUSION 

In this chapter we have looked at noisy text analytics. 
This topic is gaining in importance as more and more 
noisy data gets generated and needs processing. In 
particular we have looked at techniques for correcting 
noisy text and for doing classification. We have pre- 
sented a survey of existing techniques in the area and 
have shown that even though it is a difficult problem 
it is possible to address it with a combination of new 
and existing techniques. 
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KEY TERMS 

Automatic Speech Recognition: Machine recogni- 
tion and conversion of spoken words into text. 

Data Mining: The application of analytical methods 
and tools to data for the purpose of identifying patterns, 
relationships or obtaining systems that perform useful 
tasks such as classification, prediction, estimation, or 
affinity grouping. 

Information Extraction: Automatic extraction of 
structured knowledge from unstructured documents. 

Noisy Text: Text with any kind of difference in the 
surface form, from the intended, correct or original 
text. 

Optical Character Recognition: Translation of 
images of handwritten or typewritten text (usually 
captured by a scanner) into machine-editable text. 
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Rule Induction: Process of learning, from cases 
or instances, if-then rule relationships that consist of 
an antecedent (if-part, defining the preconditions or 
coverage of the rule) and a consequent (then-part, 
stating a classification, prediction, or other expres- 
sion of a property that holds for cases defined in the 
antecedent). 

Text Analytics: The process of extracting useful and 
structured knowledge from unstructured documents to 
find useful associations and insights. 

Text Classification (or Text Categorization): Is 

the task of learning models for a given set of classes 
and applying these models to new unseen documents 
for class assignment. 



ENDNOTES 

1 According to http://www.mrc-cbu.cam.ac.uk/ 
%7Emattd/Cmabrigde/, this is an internet hoax. 
However we found it interesting and hence in- 
cluded here. 

2 http://www.daviddlewis.com/resources/testcol- 
lections/ 

3 This dataset is available from http://kdd.ics.uci. 
edu/databases/reuters_transcribed/reuters_tran- 
scribed.html 

4 Note: this assumption may not hold true in the 
case of cognitive errors 
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INTRODUCTION 

The importance of text mining applications is growing 
proportionally with the exponential growth of electronic 
text. Along with the growth of internet many other 
sources of electronic text have become really popular. 
With increasing penetration of internet, many forms 
of communication and interaction such as email, chat, 
newsgroups, blogs, discussion groups, scraps etc. have 
become increasingly popular. These generate huge 
amount of noisy text data everyday. Apart from these 
the other big contributors in the pool of electronic text 
documents are call centres and customer relationship 
management organizations in the form of call logs, 
call transcriptions, problem tickets, complaint emails 
etc., electronic text generated by Optical Character 
Recognition (OCR) process from hand written and 
printed documents and mobile text such as Short Mes- 
sage Service (SMS). Though the nature of each of these 
documents is different but there is a common thread 
between all of these — presence of noise. 

An example of information extraction is the extrac- 
tion of instances of corporate mergers, more formally 
MergerBetween(companyl, company 2, date), from an 
online news sentence such as: "Yesterday, New-York 
based Foo Inc. announced their acquisition of Bar 
Corp. " Opinion(productl,good), from a blog post such 
as: "I absolutely liked the texture ofSheetK quilts. " 

At superficial level, there are two ways for informa- 
tion extraction from noisy text. The first one is cleaning 
text by removing noise and then applying existing state 
of the art techniques for information extraction. There 
in lies the importance of techniques for automatically 
correcting noisy text. In this chapter, first we will review 
some work in the area of noisy text correction. The sec- 
ond approach is to devise extraction techniques which 
are robust with respect to noise. Later in this chapter, 



we will see how the task of information extraction is 
affected by noise. 

NOISY TEXT CORRECTION 

Before moving on to techniques for processing noisy 
text we will briefly introduce methods for correcting 
noisy text. One of the most common forms of noise in 
text is wrong spelling. Kukich provides a comprehen- 
sive survey of techniques pertaining to detecting and 
correcting spelling errors (Kukich, 1992). According 
to this survey, three types of nonword misspellings are 
typically found viz. typographic such as teh, speel, 
cognitive such as recieve, conspeeracy and phonetic 
such as abiss, nacherly. A distinction must be made 
between automatically detecting such errors and auto- 
matically correcting those errors. The latter is a much 
harder problem. Most of the recent work in this area 
is about correcting spelling mistakes automatically. 
Golding and Roth (Golding & Roth, 1999) proposed 
a combination of a variant of Winnow, a multiplicative 
weight-update algorithm and weighted maj ority voting 
for context sensitive spelling correction. Mangu and 
Brill (Mangu & Brill, 1997) have shown that a small 
set of human understandable rules is more meaningful 
than a large set of opaque features and weights. Hybrid 
methods capturing the context using trigrams of the 
parts-of-speech tags and a feature based method have 
also been proposed to handle context sensitive spelling 
correction (Golding & Schabes, 1996). There is a lot of 
work related to automatic correction of spelling errors 
(Agirre et. al., 1998), (Zamora et. al., 1983), (Golding, 
1995). A complete bibliography of all the work related 
to spelling error detection and correction can be found 
in (Beebe, 2005). On a related note, automatic spelling 
error correction techniques have been applied for other 
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applications such as semantic role labelling (Sang et. 
al., 2005). 

There is also recent work on correcting the output of 
SMS text (Aw et. al., 2006) (Choudhury et. al., 2007), 
OCR errors (Nartker et. al., 2003) and ASR errors 
(Sarma & Palmer, 2004). 



INFORMATION EXTRACTION FROM 
NOISY TEXT 

The goal of Information Extraction (IE) is to automati- 
cally extract structured information from the unstruc- 
tured documents. The extracted structured information 
has to be contextually and semantically well-defined 
data from a given domain. Atypical application of IE is 
to scan a set of documents written in natural language 
and populate a database with the information extracted. 
The MUC (Message Understanding Conference) con- 
ference was one effort at codifying the IE task and 
expanding it (Chinchor, 1998). 

There are two basic approaches to the design of IE 
systems. One comprises the knowledge engineering 
approach where a domain expert writes a set of rules 
to extract the sought after information. Typically the 
process of building the system is iterative whereby a 
set of rules is written, the system is run and the output 
examined to see how the system is performing. The 
domain expert then modifies the rules to overcome any 
under- or over-generation in the output. The second 
is the automatic training approach. This approach is 
similar to classification where the texts are appropriately 
annotated with the information being extracted. For 
example, if we would like to build a city name extractor, 
then the training set would include documents with all 
the city names marked. An IE system would be trained 
on this annotated corpus to learn the patterns that would 
help in extracting the necessary entities. 

An information extraction system typically consists 
of natural language processing steps such as morpho- 
logical processing, lexical processing and syntactic 
analysis. These include stemming to reduce inflected 
forms of words to their stem, parts of speech tagging 
to assign labels such as noun, verb, etc. to each word 
and parsing to determine the grammatical structure of 
sentences. 



Named Entity Annotation of Web Posts 

Extraction of named entities is a key IE task. It seeks 
to locate and classify atomic elements in the text into 
predefined categories such as the names of persons, or- 
ganizations, locations, expressions of times, quantities, 
monetary values, percentages, etc. Entity recognition 
systems either use rule based techniques or statistical 
models. Typically a parser or a parts of speech tagger 
identifies elements such as nouns, noun phrases, or 
pronouns. These elements along with surface forms 
of the text are used to define templates for extract- 
ing the named entities. For example, to tag company 
names it would be desirable to look at noun phrases 
that contain the words company or incorporated in 
them. These rules can be automatically learnt using 
a tagged corpus or could be defined manually. Most 
known approaches do this on clean well formed text. 
However, named entity annotation of web posts such 
as online classifieds, product listings etc. is harder be- 
cause these texts are not grammatical or well written. 
In such cases reference sets have been used to annotate 
parts of the posts (Michelson & Knoblock, 2005). The 
reference set is thought of as a relational set of data 
with a defined schema and consistent attribute values. 
Posts are now matched to their nearest records in the 
reference set. In the biological domain gene name an- 
notation, even though it is performed on well written 
scientific articles, can be thought of in the context of 
noise, because many gene names overlap with common 
English words or biomedical terms. There have been 
studies on the performance of the gene name annotator 
when trained on noisy data (Vlachos, 2006). 

Information Extraction from OCRed 
Documents 

Documents obtained from OCR may have not only 
unknown words and compound words, but also incor- 
rect words due to OCR errors. In their work Miller 
et. al. (Miller et. al., 2000) have measured the effect 
of OCR noise on IE performance. Many IE methods 
work directly on the document image to avoid errors 
resulting from converting to text. They adopt keyword 
matching by searching for string patterns and then use 
global document models consisting of keyword models 
and their logical relationships to achieve robustness 
in matching (Lu & Tan, 2004). The presence of OCR 
errors has a detrimental effect on information access 
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from these documents (Taghva et. al., 2004). How- 
ever, post processing of these documents to correct 
these errors exist and have been shown to give large 
improvements. 

Information Extraction from ASRed 
Documents 

The output of an ASR system does not contain case 
information and punctuations. It has been shown that 
in the absence of punctuations extraction of different 
syntactic entities like parts of speech and noun phrases 
is not accurate (Nasukawa et. al., 2007). So IE from 
ASRed documents becomes harder. Miller et. al. (Miller 
et. al., 2000) have shown how IE performance varies 
with ASR noise. It has been shown that it is possible 
to build aggregate models from ASR data (Roy & 
Subramaniam, 2006). In this work topical models are 
constructed by utilizing inter document redundancy 
to overcome the noise. In this work only a few natural 
language processing steps have been used. Phrases 
have been aggregated over the noisy collection to get 
to the clean underlying text. 



FUTURE TRENDS 

More and more data from sources like chat, conver- 
sations, blogs, discussion groups need to be mined 
to capture opinions, trends, issues and opportunities. 
These forms of communication encourage informal 
language which can be considered noisy due to spell- 
ing errors, grammatical errors and informal writing 
styles. Companies are interested in mining such data 
to observe customer preferences and improve customer 
satisfaction. Online agents need to be able to understand 
web posts to take actions and communicate with other 
agents. Customers are interested in collated product 
reviews from web posts of other users. The nature 
of the noisy text warrants moving beyond traditional 
text analytics techniques. There is need for developing 
natural language processing techniques that are robust 
to noise. Also techniques that implicitly and explicitly 
tackle textual noise need to be developed. 



CONCLUSION 

In this chapter we have looked at information extraction 
from noisy text. This topic is gaining in importance as 
more and more noisy data gets generated and useful 
information needs to be obtained from this. We have 
presented a survey of existing techniques information 
extraction techniques. We have also presented some of 
the future trends in noisy text analytics. 
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KEY TERMS 



Automatic Speech Recognition: Machine recogni- 
tion and conversion of spoken words into text. 

Data Mining: The application of analytical methods 
and tools to data for the purpose of identifying patterns, 
relationships or obtaining systems that perform useful 
tasks such as classification, prediction, estimation, or 
affinity grouping. 

Information Extraction: Automatic extraction of 
structured knowledge from unstructured documents. 

Knowledge Extraction: Explicitation of the internal 
knowledge of a system or set of data in a way that is 
easily interpretable by the user. 

Noisy Text: Text with any kind of difference in the 
surface form, from the intended, correct or original 
text. 

Optical Character Recognition: Translation of 
images of handwritten or typewritten text (usually 
captured by a scanner) into machine-editable text. 

Rule Induction: Process of learning, from cases 
or instances, if-then rule relationships that consist of 
an antecedent (if-part, defining the preconditions or 
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coverage of the rule) and a consequent (then-part, 
stating a classification, prediction, or other expres- 
sion of a property that holds for cases defined in the 
antecedent). 

Text Analytics: The process of extracting useful and 
structured knowledge from unstructured documents to 
find useful associations and insights. 
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INTRODUCTION 

Heart-related pathologies are among the most frequent 
health problems in western society. Symptoms that point 
towards cardiovascular diseases are usually diagnosed 
with angiographies, which allow the medical expert 
to observe the bloodflow in the coronary arteries and 
detect severe narrowing (stenosis). According to the 
severity, extension, and location of these narrowings, 
the expert pronounces a diagnosis, defines a treatment, 
and establishes a prognosis. 

The current modus operandi is for clinical experts to 
observe the image sequences and take decisions on the 
basis of their empirical knowledge. Various techniques 
and segmentation strategies now aim at objectivizing 
this process by extracting quantitative and qualitative 
information from the angiographies. 



BACKGROUND 

Segmentation is the process that divides an image in 
its constituting parts or obj ects. In the present context, it 
consists in separating the pixels that compose the coro- 
nary tree from the remaining "background" pixels. 

None of the currently applied segmentation methods 
is able to completely and perfectly extract the vascula- 
ture of the heart, because the images present complex 
morphologies and their background is inhomogeneous 
due to the presence of other anatomic elements and 
artifacts such as catheters. 

The literature presents a wide array of coronary tree 
extraction methods: some apply pattern recognition 



techniques based on pure intensity, such as threshold- 
ing followed by an analysis of connected components, 
whereas others apply explicit vessel models to extract 
the vessel contours. 

Depending on the quality and noise of the image, 
some segmentation methods may require image pre- 
processing prior to the segmentation algorithm; others 
may need postprocessing operations to eliminate the 
effects of a possible oversegmentation. 

The techniques and algorithms for vascular seg- 
mentation could be categorized as follows (Kirbas, 
Quek, 2004): 

1. Techniques for "pattern-matching" or pattern 
recognition 

2. Techniques based on models 

3. Techniques based on tracking 

4. Techniques based on artificial intelligence 

5. Main Focus 

This section describes the main features of the 
most commonly accepted coronary tree segmentation 
techniques. These techniques automatically detect 
objects and their characteristics, which is an easy and 
immediate task for humans, but an extremely complex 
process for artificial computational systems. 

Techniques Based on Pattern 
Recognition 

The pattern recognition approaches can be classified 
into four major categories: 
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Figure 1. Regions growth applied to an angiography 





Multiscale Methods 

The multiscale method extracts the vessel method by 
means of images of varying resolutions. The main 
advantage of this technique resides in its high speed. 
Larger structures such as main arteries are extracted 
by segmenting low resolution images, whereas smaller 
structures are obtained through high resolution im- 
ages. 

Methods Based on Skeletons 

The purpose of these methods is to obtain a skeleton 
of the coronary tree: a structure of smaller dimen- 
sions than the original that preserves the topological 
properties and the general shape of the detected object. 
Skeletons based on curves are generally used to recon- 
struct vascular structures (Nystrom, Sanniti di Baja & 
Svensson, 2001). Skeletonizing algorithms are also 
called "thinning algorithms". 

The first step of the process is to detect the central 
axis of the vessels or "centerline". This axis is an 
imaginary line that follows each vessel in its central 
axis, i.e. two normal segments that cross the axis in 
opposite sense should present the same distance from 
the vessel's edges. The total of these lines constitutes 
the skeleton of the coronary tree. The methods that are 
used to detect the central axes can be classified into 
three categories: 

Methods Based on Crests 

One of the first methods to segment angiographic im- 
ages on the basis of crests was proposed by Guo and 



Richardson (Guo & Ritchardson, 1998). This method 
treats angiographies as topographic maps in which 
the detected crests constitute the central axes of the 
vessels. 

The image is preprocessed by means of a median 
filter and smoothened with non-linear diffusion. The 
region of interest is then selected through thresholding, 
a process that eliminates the crests that do not correspond 
with the central axes. Finally, the candidate central axes 
are joined with curve relaxation techniques. 

Methods Based on Regions Growth 

Taking a known point as seed point, these techniques 
segment images through the incremental inclusion of 
pixels in a region on the basis of an a priori established 
criterion. There are two especially important criteria: 
similitude in the value, and spatial proximity (Jain, 
Kasturi & Schunck, 1995). It is established that pixels 
that are sufficiently near others with similar grey levels 
belong to the same object. The main disadvantage of 
this method is that it requires the intervention of the 
user to determine the seed points. 

O'Brien and Ezquerra (O'Brien & Ezquerra, 1994) 
propose the automatic extraction of the coronary ves- 
sels in angiograms on the basis of temporary, spatial, 
and structural restrictions. The algorithm starts with 
a low-pass filter and the user's definition of a seed 
point. The system then starts to extract the central axes 
by means of the "globe test" mechanism, after which 
the detected regions are entangled through the graph 
theory. The applied test also allows us to discard the 
regions that are detected incorrectly and do not belong 
to the vascular tree. 
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Methods Based on Differential Geometry 

The methods that are based on differential geometry 
treat images as hypersurfaces and extract their fea- 
tures using curvature and surface crests. The points of 
hypersurf ace's crest correspond to the central axis of 
the structure of a vessel. This method can be applied 
to bidimensional as well as tridimensional images; 
angiograms are bidimensional images and are therefore 
modelled as tridimensional hypersurfaces. 

Examples of reconstructions can be found in Prinet 
et al (Prinet, Mona & Rocchisani, 1995), who treat the 
images as parametric surfaces and extract their features 
by means of surfaces and crests. 

Correspondence Filters Methods 

The correspondence filter approach convolutes the 
image with multiple correspondence filters so as to 
extract the regions of interest. The filters are designed 
to detect different sizes and orientations. 

Poli and Valli (Poli, R & Valli, 1997) apply this 
technique with an algorithm that details a series of 
multiorientation linear filters that are obtained as linear 
combinations of Gaussian "kernels". These filters are 
sensitive to different vessel widths and orientations. 

Mao et al (Mao, Ruan, Bruno, Toumoulin, Col- 
lorec & Haigron, 1992) also use this type of filters in 
an algorithm based on visual perception models that 
affirm that the relevant parts of the objects in images 
with noise appear normally grouped. 

Morphological Mathematical Methods 

Mathematical morphology defines a series of operators 
that apply structural elements to the images so that 



their morphological features can be preserved and ir- 
relevant elements eliminated. The main morphological 
operations are the following: 

Dilatation: Expands objects, fills up empty spaces, 
and connects disjunct regions. 
Erosion: Contracts objects, separates regions. 
Closure: Dilatation + Erosion. 
Opening: Erosion + Dilatation. 
"Top hat" transformation: Extracts the struc- 
tures with a linear shape 
"Watershed" transformation: "Inundates" the 
image that is taken as a topographic map , and 
extracts the parts that are not "flooded". 

Eiho and Qian (Eiho & Qian, 1997) use a purely 
morphological approach to define an algorithm that 
consists of the following steps: 

1 . Application of the "top hat" operator to emphasize 
the vessels 

2. Erosion to eliminate the areas that do not cor- 
respond to vessels 

3. Extraction of the tree from a point provided by 
the user and on the basis of grey levels. 

4. Slimming down of the tree 

5. Extraction of edges through "watershed" trans- 
formation 



MODEL-BASED TECHNIQUES 

These approaches use explicit vessel models to extract 
the vascular tree. They can be divided into four catego- 



Figure 2. Morphological operators applied to an angiography 
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ries: deformable models, parametric models, template 
correspondence models, and generalized cylinders. 

Deformable Models 

Strategies based on deformable models can be classified 
in terms of the work by Mclnerney and Terzopoulos 
(Mclnerney & Terzopoulos, 1997). 

Algorithms that use deformable models (Merle, 
Finet, Lienard, & Magnin, 1997) are based on the 
progressive refining of an initial skeleton built with 
curves from a series of reference points: 

Root points: Starting points for the coronary 

tree. 

Bifurcation points: Points where a main branch 

divides into a secundary branch. 

End points: Points where a tree branch ends. 

These points have to be marked manually. 

Deformable Parametric Models: 
Active Contours 

These models use a set of parametric curves that 
adjust to the object's edges and are modified by both 
external forces, that foment deformation, and internal 
forces that resist change. The active contour models 
or "snakes" in particular are a special case of a more 
general technique that pretends to adjust deformable 
models by minimizing energy. 

Klein et al. (Klein, Lee & Amini, 1997) propose 
an algorithm that uses "snakes" for 4D reconstruction: 
they trace the position of each point of the central axis 
of a skeleton in a sequence of angiograms. 

Deformable Geometric Models 

These models are based on topographic models that are 
adapted for shape recognition. Malladi et al. (Malladi, 
Sethian & Vemuri, 1 995) for instance adapt the "Level 
Set Method" (LSM) by representing an edge as a level 
zero set of a hypersurf ace of a superior order; the model 
evolves to reduce a metric defined by the restrictions of 
edges and curvature, but less rigidly than in the case of 
the "snakes". This edge, which constitutes the zero level 
of the hypersurface, evolves by adjusting to the edges 
of the vessels, which is what we want to detect. 



Propagation Methods 

Quek and Kirbas (Quek & Kirbas, 2001) developed 
a system of wave propagation combined with a back- 
tracking mechanism to extract the vessels from an- 
giographic images. This method basically labels each 
pixel according to its likeliness to belong to a vessel 
and then propagates a wave through the pixels that are 
labeled as belonging to the vessel; it is this wave that 
definitively extracts the vessels according to the local 
features it encounters. 

Approaches based on the correspondence of de- 
formable templates: 

This approach tries to recognize structural models 
(templates) in an image by using a template as context, 
i.e. as a priori model. This template is generally repre- 
sented as a set of nodes connected by a segment. The 
initial structure is deformed until it adjusts optimally 
to the structures that were observed in the image. 

Petrocellietal. (Petrocelli, Manbeck, &Elion, 1993) 
describe a method based on deformable templates that 
also incorporates additional previous knowledge into 
the deformation process. 

Parametric Models 

These models are based on the a priori knowledge 
of the artery's shape and are used to build models 
whose parameters depend on the profiles of the entire 
vessel; as such, they consider the global information 
of the artery instead of merely the local information. 
The value of these parameters is established after a 
learning process. 

The literature shows the use of models with circu- 
lar sections (Shmueli, Brody, & Macovski, 1983) and 
spiral sections (Pappas, & Lim, 1984), because various 
studies by Brown, B. G., (Bolson, Frimer, & Dodge, 
1977) (Brown, Bolson, Frimer & Dodge, 1982) show 
that sections of healthy arteries tend to be circular and 
sections with stenosis are usually elliptical. However, 
both circular and elliptical shapes fail to approach ir- 
regular shapes caused by pathologies or bifurcations. 

This model has been applied to the reconstruction 
of vascular structures with two angiograms (Pellot, 
Herment, Sigelle, Horain, Maitre & Peronneau, 1994), 
which is why both healthy and stenotic sections are mod- 
eled by means of ellipses. This model is subsequently 
deformed until it corresponds to the shape associated 
to the birth of a new branch or pathology. 
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Figure 3. "Snakes" applied to a bloodvessel http://vislab.cs.vt.edu/review/extraction.html 









Generalized Cylinder Models 



ARTERIAL TRACKING 



A generalized cylinder (GC) is a solid whose central 
axis is a 3D curve. Each point of that axis has a limited 
and closed section that is perpendicular to it. A CG is 
therefore defined in space by a spatial curve or axis 
and a function that defines the section in that axis. The 
section is usually an ellipse. Tecnically, GCs should 
be included in the parametric methods section, but the 
work that has been done in this field is so extense that 
it deserves its own category. 

The construction of the coronary tree model requires 
one single view to build the 2D tree and estimate the 
sections. However, there is no information on the depth 
or the area of the sections, so a second projection will 
be required. 



Contrary to the approaches based on pattern recognition, 
where local operators are applied to the entire image, 
techniques based on arterial follow-up are based on the 
application of local operators in an area that presumibly 
belongs to a vessel and that cover its length. From a 
given point of departure the operators detect the central 
axis and, by analyzing the pixels that are orthogonal 
to the tracking direction, the vessel's edges. There are 
various methods to determine the central axis and the 
edges: some methods carry out a sequential track- 
ing and incorporate connectivity information after a 
simple edge detection operation, other methods use 
this information to sequentially track the contours. 
There are also approaches based on the intensity of 
the crests, on fuzzy sets, or on the representation of 
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Figure 4. Tracking applied to an angiography 





graphs, where the purpose lies in finding the optimal 
road in the graph that represents the image. 

Lu and Eiho (Lu, Eiho, 1993) have described a 
follow-up algorithm for the vascular edges in angiog- 
raphies that considers the inclusion of branches and 
consists of three steps: 

1. Edge detection 

2. Branch search 

3. Tracking of sequential contours 

The user must provide the point of departure, the 
direction, and the search range. The edge points are 
evaluated with a differential smoothening operator in a 
line that is perpendicular to the direction of the vessel. 
This operator also serves to detect the branches. 



are then used to formulate a hierarchy with which to 
create the model. This type of system does not offer 
any good results in arterial bifurcations or in arteries 
with occlusions. 

Another approach (Stansfield, 1986) consists in 
formulating a rules-based Expert System to identify 
the arteries. During the first phase, the image is pro- 
cessed without making use of domain knowledge to 
extract segments of the vessels. It is only in the second 
phase that domain knowledge on cardiac anatomy and 
physiology is applied. 

The latter approach is more robust than the former; 
but it presents the inconvencience of not combining all 
the segments into one vascular structure. 



FUTURE TRENDS 



TECHNIQUES BASED ON ARTIFICIAL 
INTELLIGENCE 

Approaches based on Artificial Intelligence use high- 
level knowledge to guide the segmentation and delinea- 
tion of vascular structures and sometimes use different 
types of knowledge from various sources. 

One possibility (Smets, Verbeeck, Suetens, & 
Oosterlinck, 1988) is to use rules that codify knowl- 
edge on the morphology of blood vessels; these rules 



It cannot be said that one technique has a more promising 
future than another, but the current tendency is to move 
away from the abovementioned classical segmentation 
algorithms towards 3D and even 4D reconstructions 
of the coronary tree. 

Other lines of research focus on obtaining angio- 
graph images by means of new acquisition technologies 
such as Magnetic Resonance, Computarized High 
Speed Tomography, or two-armed angiograph de- 
vices that achieve two simultaneous projections in 
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combination with the use of ultrasound intravascular 
devices. This type of acquisition simplifies the creation 
of tridimensional structures, either directly from the 
acquisition or after a simple processing of the bidi- 
mensional images. 
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KEY TERMS 

Angiography: Image of blood vessels obtained by 
any possible procedure. 

Artery: Each of the vessels that take the blood from 
the heart to the other bodyparts. 

Computerized Tomography: Exploration of X- 
rays that produces detailed images of axial cuts of the 



body. A CT obtains many images by rotating around 
the body. A computer combines all these images into a 
final image that represents the bodycut like a slice. 

Expert System: Computer or computer program 
that can give responses that are similar to those of an 
expert. 

Segmentation: In computer vision, segmentation 
refers to the process of partitioning a digital image into 
multiple regions. The goal of segmentation is to sim- 
plify and/or change the representation of an image into 
something that is more meaningful and easier to analyze . 
Image segmentation is typically used to locate objects 
and boundaries (structures) in images, in this case, the 
coronary tree in digital angiography frames. 

Stenosis: A stenosis is an abnormal narrowing 
in a blood vessel or other tubular organ or structure. 
A coronary artery that's constricted or narrowed is 
called stenosed. Buildup of fat, cholesterol and other 
substances over time may clog the artery. Many heart 
attacks are caused by a complete blockage of a vessel 
in the heart, called a coronary artery. 

Thresholding: A technique for the processing of 
digital images that consists in applying a certain prop- 
erty or operation to those pixels whose intensity value 
exceeds a defined threshold. 
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INTRODUCTION 

Artificial Intelligence (AI) mechanisms are more 
and more frequently applied to all sorts of civil 
engineering problems. New methods and algorithms 
which allow civil engineers to use these techniques 
in a different way on diverse problems are available 
or being made available. One AI techniques stands 
out over the rest: Artificial Neural Networks (ANN). 
Their most remarkable traits are their ability to learn, 
the possibility of generalization and their tolerance 
towards mistakes. These characteristics make their 
use viable and cost-efficient in any field in general, 
and in Structural Engineering in particular. The most 
extended construction material nowadays is concrete, 
mainly because of its high resistance and its adaptability 
to formwork during its fabrication process. Along this 
chapter we will find different applications of ANNs to 
structural concrete. 

Artificial Neural Networks 

Warren McCulloch and Walter Pitts are credited for the 
origin of Artificial Networks in the 1940s, since they 
were the first to design an artificial neuron (McCulloch 
& Pitts, 1943). They proposed the binary mode (active 
or inactive) neuron model with a fixed threshold which 
must be surpassed for it to change state. Some of the 
concepts they introduced still hold useful today. 

Artificial Neural Networks intend to simulate 
the properties found in biological neural systems 
through mathematical models by the way of artificial 
mechanisms. A neuron is considered a formal element, 
or module, or basic network unit which receives 



information from other modules or the environment; it 
then integrates and computes this information to emit 
a single output which will be identically transmitted to 
subsequent multiple neurons (Wasserman, 1989). 

The output of an artificial neuron is determined by 
its propagation or excitation, activation and transfer 
functions. 

The propagation function is generally the 
summation of each input multiplied by the weight of 
its interconnection (net value): 



n, 



JV-l 



(1) 



j=o 



The activation function modifies the latter, relating 
the neural input to the next activation state. 



a i (t) = FA[a i (t-l),n i (t-l)] 



(2) 



The transfer function is applied to the result of the 
activation function. It is used to bound the neuron's 
output and is generally given by the interpretation 
intended for the output. Some of the most commonly 
used transfer functions are the sigmoid (to obtain values 
in the [0,1] interval) and the hyperbolic tangent (to 
obtain values in the [-1,1] interval). 



out^FTfait)) 



(3) 



Once each element in the process is defined, the type 
of network (network topology) to use must be designed. 
These can be divided in forward-feed networks, where 
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information moves in one direction only (from input 
to output), and networks with partial or total feedback, 
where information can flow in any direction. 

Finally, learning rules and training type must be 
defined. Learning rules are divided in supervised and 
non-supervised (Brown & Harris, 1994) (Lin & Lee, 
1996) and within the latter, self-organizing learning 
and reinforcement learning (Hoskins & Himmelblau, 
1992). The type of training will be determined by the 
type of learning chosen. 

An Introduction to Concrete (Material 
and Structure) 

Structural concrete is a construction material created 
from the mixture of cement, water, aggregates and 
additions or admixtures with diverse functions. The goal 
is to create a material with rock-like appearance, with 
sufficient compressive strength and the ability to adopt 
adequate structural shapes. Concrete is moldable during 
its preparation phase, once the components have mixed 
together go produce a fluid mass which conveniently 
occupies the cavities in a mould named form work. After 
a few hours, concrete hardens thanks to the chemical 
hydration reaction experimented by cement, generating 
a paste which envelops the aggregates and gives the 
ensemble the appearance of an artificial rock somewhat 
similar to a conglomerate. 

Hardened concrete offers good compressive 
strength, but very low tensile strength. This is why 
structures created with this material must be reinforced 
by use of steel rebars, configured by rods which are 
placed (before pouring the concrete) along the lines 
where calculation predicts the highest tensile stresses. 
Cracking, which reduces the durability of the structure, 
is thus hindered, and sufficient resistance is guaranteed 
with a very low probability of failure. The entirety 
formed by concrete and rebar is referred to as Structural 
Concrete (Shah, 1993). 

Two phases thus characterize the evolution of 
concrete in time. In the first phase, concrete must be 
fluid enough to ensure ease of placement, and a time 
to initial set long enough to allow transportation from 
plant to worksite. Flowability depends basically on 
the type and quantity of the ingredients in the mixture. 
Special chemical admixtures (such as plasticizers and 
superplasticizers) guarantee flowability without grossly 
increasing the amount of water, whose ratio relative to 
the amount of cement (or water/cement ratio, w/c) is on 



reverse proportion to strength attained. The science of 
rheology deals with the study of the behavior of fresh 
concrete. A variety of tests can be used to determine 
flowability of fresh concrete, the most popular amongst 
them being the Abrams cone (Abrams, 1922) or slump 
cone test (Domone, 1998). 

The second phase (and longest over time) is the 
hardened phase of concrete, which determines the 
behavior of the structure it gives shape to, from the point 
of view of serviceability (by imposing limitations on 
cracking and compliance) and resistance to failure (by 
imposing limitations on the minimal loads that can be 
resisted, as compared to the internal forces produced by 
external loading), always within the frame of sufficient 
durability for the service life foreseen. 

The study of structural concrete from every 
point of view has been undertaken following many 
different optics. The experimental path has been very 
productive, generating along the past 50 years a database 
(with a tendency to scatter) which has been used to 
sanction studies carried along the second and third 
path that follow. The analytical path also constitutes 
a fundamental tool to approach concrete behavior, 
both from the material and structural point of view. 
Development of theoretical behavior models goes back 
to the early 20th century, and theoretical equations 
developed since have been corrected through testing 
(as mentioned above) before becoming a part of codes 
and specifications. This method of analysis has been 
reinforced with the development of numerical methods 
and computational systems, capable of solving a great 
number of simultaneous equations. In particular, the 
Finite Element Method (and other methods in the same 
family) and optimization techniques have brought 
a remarkable capacity to approximate behavior of 
structural concrete, having their results benchmarked in 
may applications by the aforementioned experimental 
testing. 

Three basic lines of study are thus available. Being 
complementary between them, they have played 
a decisive role in the production of national and 
international codes and rules which guide or legislate 
the project, execution and maintenance of structural 
concrete works. Concrete is a complex material, which 
presents a number of problems for analytical study, and 
so is an adequate field for the development of analysis 
techniques based on neural networks (Gonzalez, 
Martinez and Carro, 2006) 
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Application of Artificial Neural Networks to 
problems in the field of structural concrete has unfolded 
in the past few years in two ways. On one hand, 
analytical and structural optimization systems faster 
than traditional (usually iterative) methods have been 
generated starting with expressions and calculation 
rules. On the other, the numerous databases created 
form the large amount of tests published in the scientific 
community have allowed for the development of very 
powerful ANN which have thrown light on various 
complex phenomena. In a few cases, specific designed 
codes have been improved through the use of these 
techniques; some examples follow. 

Application of Artificial Neural Networks 
to Optimization Problems 

Design of concrete structures is based on the 
determination of two basic parameters: member 
thickness (effective depth d, depth of a beam or slab 
section measured from the compression face to the 
centroid of reinforcement) and amount of reinforcement 
(established as the total area A s of steel in a section, 
materialized as rebars, or the reinforcement ratio, 
the ratio between steel area and concrete area in the 
section). Calculation methods are iterative, since a 
large number of conditions must be verified in the 
structure, and the aforementioned parameters are 
fixed as a function of three basic conditions which are 
sequentially followed: structural safety, maximum 
ductility at failure and minimal cost. Design rules, 
expressed through equations, allow for a first solution 
which is corrected to meet all calculation scenarios, 
finally converging when the difference between input 
and output parameters are negligible. 

In some cases it is possible to develop optimization 
algorithms, whose analytical formulation opens the way 
to the generation of a database. Hadi (Hadi, 2003) has 
performed this work for simply supported reinforced 
concrete beams, and the expressions obtained after 
the optimization process determine the parameters 
specified above, while simultaneously assigning the cost 
associated to the optimal solution (related to the cost 
of materials and formwork). With these expressions, 
Hadi develops a database with the following variables: 
applied flexural moment (M), compressive strength 
of concrete (f c ), steel strength (f ), section width (b), 
section depth (h), and unit costs of concrete (C c ), steel 
(C s ) and formwork (C ). 



Network parameters used are as follows. The number 
of training samples is 550; number of input layer neurons 
is 8; number of hidden layer neurons is 10; number of 
output layer neurons is 4; type of backpropagation is 
Levenberg-Marquardt backpropagation; activation 
function is sigmoidal function; learning rate; 0.01; 
number of epochs is 3000; sum-square error achieved 
is 0.08. The network had been tested with 50 samples 
and yielded the average error of 6.1%. 

Hadi studies various factors when choosing network 
architecture and backpropagation algorithm type. When 
two layers of hidden neurons are used, precision is not 
improved while computation time is increased. The 
number of samples depends on the complexity of the 
problem and the number of input and output parameters . 
If a value is fixed for the input costs, there are no 
noticeable precision improvements between training 
the network with 200 or 1000 samples. When costs are 
introduced as input parameters, 100 samples are not 
enough to achieve convergence in training. Finally, the 
training algorithm is also checked, studying the range 
between pure backpropagation (too slow for training), 
backpropagation with momentum and with adaptive 
learning, backpropagation with Levenberg-Marquardt 
updating rule and fast learning backpropagation. The 
latter is finally retained since it requires less time to 
get the network to converge while providing very good 
results (Demuth, H. & Beale, M.,1995) 

Application of Artificial Neural Networks 
to Prediction of Concrete Physical 
Parameters Measurable Through 
Testing: Concrete Strength and 
Consistency 

Other neural network applications are supported by 
large experimental databases, created through years of 
research, which allow for the prediction of phenomena 
with complex analytical formulation. 

One of these cases is the determination of two basic 
concrete parameters: its workability when mixed, 
necessary for ease of placement in concrete, and its 
compressive strength once hardened, which is basic 
to the evaluation of the capacity of the structure. 
The variables that necessarily determine these two 
parameters are the components of concrete: amounts of 
cement, water, fine aggregate (sand), coarse aggregate 
(small gravel and large gravel), and other components 
such as pozzolanic additions (which bring soundness 
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and delayed strength increase, especially in the case of 
fly ash and silica fume) and admixtures (which fluidify 
the fresh mixture allowing the use of reduced amounts of 
water). There are still no analytical or numerical models 
that faithfully predict fresh concrete consistency (related 
to flowability, and usually evaluated by the slump 
of a molded concrete cone) or compressive strength 
(determined by crushing of prismatic specimens in a 
press). 

Ozta§ et al. (Ozta§, Pala, Ozbay, Kanca, £aglar & 
Bati, 2006) have developed a neural network from 187 
concrete mixes, for which all parameters are know, using 
169 of them for training and 18, randomly selected, for 
verification. Database variables are sometimes taken as 
a ratio between them, since there is available knowledge 
about the dependency of slump and strength on such 
parameters. The established range for the 7 parameter 
set is shown in Table 1. 

Network architecture, as determined by 7 input 
neurons and two hidden layers of 5 and 3 neurons 
respectively. 

The back-propagation learning algorithm has been 
used in feed-forward two hidden-layers. The learning 
algorithm used in the study is scaled conjugate gradients 
algorithm (SCGA), activation function is sigmoidal 
function, and number of epochs is 10,000. The prediction 
capacity of the network is better in the "Compressive 
Strength" output (maximum error of 6%) than in the 



Table 1. Input parameter range 



Input parameters 


Minimum 


Maximum 


W/B (ratio, %) a 


18 


45 


W (kg/m 3 ) b 


140 


165 


s/a (ratio, %) c 


35 


52 


FA (ratio, %) d 





20 


AE (kg/m 3 ) e 


0.036 


0.078 


SF (ratio, %) f 


5 


25 


SP (kg/m 3 )9 


1.89 


36.5 



(a) [Water]/[binder] ratio, considering binder as the lump sum of 
cement, fly ash and silica fume 

(b) Amount of water 

(c) [Amount of sand]/[Total aggregate (sand+small gravel+large 
gravel)] 

(d) Percentage of cement substituted by fly ash 

(e) Amount of air-entraining agent 

(f) Percentage of cement substituted by silica fume 

(g) Amount of superplasticizer 



"Slump" output (errors up to 25%). This is due to the 
fact that the relation between the chosen variables and 
strength is much stronger than in the case of slump, 
which is influenced by other non-contemplated variables 
(e. g. type and power of concrete mixer, mixing order 
of components, aggregate moisture) and the method 
for measurement of consistency, whose adequacy for 
the particular type of concrete used in the database is 
questioned by some authors. 

Application of Artificial Neural Networks 
to the Development of Design Formulae 
and Codes 

The last application presented in this paper is the 
response analysis to shear forces in concrete beams. 
These forces generate transverse tensile stresses in 
concrete beams which require placement of rebars 
perpendicular to the beam axis, known as hoops or 
ties. Analytical determination of failure load from the 
variables that intervene in this problem is very complex, 
and in general most of the formulae used today are based 
on experimental interpolations with no dimensional 
consistency. Cladera and Mari (Cladera & Mari, 2004) 
have studied the problem through laboratory testing, 
developing a neural network for the strength analysis 
of beams with no shear reinforcement. They rely on a 
database compiled by Bentz (Bentz, 2000) and Kuchma 
(Kuchma, 2002), where the variables are effective depth 
(d), beam width (£>, though introduced as d/b), shear 
span (a/d, see Figure 1), longitudinal reinforcement 
ratio (p, = A/bd) and compressive strength of concrete 
(f c ). Of course, failure load is provided for each of 
the 177 tests found in the database. They use 147 
tests to train the network and 30 for verification, on 
a one layer architecture with 10 hidden neurons and 
a retropropagation learning mechanism. The ranges 



Table 2 Input parameter ranges 



Parameter 



Minimum 



Maximum 



af(mm) 
d/b 

P,(%) 

f c (MPa) 

aid 

KJEL 



101.6 
0.37 
0.50 
14.7 
2.48 
19.52 



1090 
7.17 
6.64 
101.8 
7.86 
332.14 
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Figure 1. Span loading a of a beam. (Gonzalez, 2002) 
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k^-^J 



Table 3. Comparison between available codes and proposed equations for shear strength. 



Procedure 


ACI 11-5 


ACI 11-3 


MC-90 


EC-2 


AASHTO 


Eq.(7) 


Eq.(8) 


Average 


1.16 


1.29 


1.15 


1.02 


1.28 


1.15 


1.13 


Median 


1.15 


1.25 


1.16 


0.99 


1.25 


1.14 


1.12 


Standard 
deviation 


0.31 


0.40 


0.19 


0.23 


0.22 


0.18 


0.19 


CoV (%) 


26.89 


31.21 


16.57 


22.03 


16.80 


15.73 


16.42 


Minimum 


0.42 


0.42 


0.65 


0.57 


0.86 


0.73 


0.78 


Maximum 


2.14 


2.47 


1.78 


1.78 


2.14 


1.69 


1.85 



for the variables are shown on Table 2. Almost 8000 
iterations were required to attain best results. 

The adjustment provided by training presents an 
average ratio V tes /V red of 0.99, and 1.02 in validation. 
The authors have effectively created a laboratory 
with a neural network, in which they "test" (within 
parameter range) new beams by changing exclusively 
one parameter each time. Finally, they come up with 
two alternative design formulae that improve noticeably 
any given formula developed up to that moment. Table 
3 presents a comparison between those two expressions 
(named Eq. 7 and Eq. 8) and others found in a series 
of international codes. 



CONCLUSION 

The field of structural concrete shows great 
potential for the application of neural networks. 
Successful approaches to optimization, prediction 
of complex physical parameters and design 
formulae development have been presented. 
The network topology used in most cases for 
structural concrete is forward- feed, multilayer with 
backpropagation, typically with one or two hidden 



layers. The most commonly used training algorithms 
are descent gradient with momentum and adaptive 
learning, and Levenberg-Marquardt. 
The biggest potential of ANNs is their capacity 
to generate virtual testing laboratories which 
substitute with precision expensive real laboratory 
tests within the proper range of values. Amethodical 
"testing" program throws light on the influence of 
the different variables in complex phenomena at 
reduced cost. 

The field of structural concrete counts upon 
extensive databases, generated through the years, 
that can be analyzed with this technique. An 
effort should be made to compile and homogenize 
these databases to extract the maximum possible 
knowledge, which has great influence on structural 
safety. 
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KEY TERMS 

Compression: 

squeezing. 



Stress generated by pressing or 



Consistency: The relative mobility or ability of 
freshly mixed concrete or mortar to flow; the usual 
measurement for concrete is slump, equal to the 
subsidence measured to the nearest 1/4 in. (6 mm) of 
a molded specimen immediately after removal of the 
slump cone. 

Ductility: That property of a material by virtue of 
which it may undergo large permanent deformation 
without rupture. 

Formwork: Total system of support for freshly 
placed concrete including the mold or sheathing that 
contacts the concrete as well as supporting members, 
hardware, and necessary bracing; sometimes called 
shuttering in the UK. 

Shear Span: Distance between a reaction and the 
nearest load point. 
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Structural Safety: Structural response stronger than 
the internal forces produced by external loading. 

Tension: Stress generated by stretching. 
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INTRODUCTION 

Among all of the Artificial Intelligence techniques, 
Artificial Neural Networks (ANNs) have shown to be a 
very powerful tool (McCulloch & Pitts, 1943) (Haykin, 
1999). This technique is very versatile and therefore has 
been succesfully applied to many different disciplines 
(classification, clustering, regression, modellization, 
etc.) (Rabunal & Dorado, 2005). 

However, one of the greatest problems when using 
ANNs is the great manual effort that has to be done in 
their development. A big myth of ANNs is that they 
are easy to work with and their development is almost 
automatically done. This development process can be 
divided into two parts: architecture development and 
training and validation. As the network architecture is 
problem-dependant, the design process of this architec- 
ture used to be manually performed, meaning that the 
expert had to test different architectures and train them 
until finding the one that achieved best results after the 
training process. The manual nature of the described 
process determines its slow performance although the 
training part is completely automated due to the exis- 
tence of several algorithms that perform this part. 

With the creation of Evolutionary Computation 
(EC) tools, researchers have worked on the application 
of these techniques to the development of algorithms 
for automatically creating and training ANNs so the 
whole process (or, at least, a great part of it) can be 
automatically performed by computers and therefore 
few human efforts has to be done in this process. 



BACKGROUND 

EC is a set of tools based on the imitation of the natural 
behaviour of the living beings for solving optimization 
problems. One of the most typical subset of tools inside 



EC is called Evolutionary Algorithms (EAs), which are 
based on natural evolution and its implementation on 
computers. All of these tools work with the same basis: 
a population of solutions to that particular problem is 
randomly created and an evolutionary process is applied 
to it. From this initial random population, the evolution is 
done by means of selection and combination of the best 
individuals (although the worst ones also have a small 
probability of being chosen) to create new solutions. 
This process is carried out by selection, crossover, and 
mutation operators. These operators are typically used 
in biology in its evolution for adaptation and survival. 
After several generations, it is hoped that the population 
contains a good solution to the problem. 

The first EA to appear was Genetic Algorithms 
(GAs), in 1975 (Holland, 1975). With the working 
explained above, GAs use a binary codification (i.e., 
each solution is codified into a string of bits). Later, in 
the early 90s a new technique appeared, called Genetic 
Programming (GP). This one is based ob the evolution 
of trees, i.e., each individual is codified as a tree instead 
of a binary string. This allows its application to a wider 
set of environments. 

Although GAs and GP are the two most used tech- 
niques in EAs, more tools can be classified as part 
of this world, such as Evolutionary Programming or 
Evolution Strategies, all of them with the same basis: 
the evolution of a population following the natural 
evolution rules. 



DEVELOPMENT OF ANNS WITH EC 
TOOLS 

The development of ANNs is a topic that has been 
extensively dealt with very diverse techniques. The 
world of evolutionary algorithms is not an exception, 
and proof of that is the great amount of works that have 
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been published about different techniques in this area 
(Cantu-Paz & Kamath, 2005). These techniques follow 
the general strategy of an evolutionary algorithm: an 
initial population consisting of different genotypes, each 
one of them codifying different parameters (typically, 
the weight of the connections and / or the architecture 
of the network and / or the learning rules), and is ran- 
domly created. This population is evaluated in order to 
determine the fitness of each individual. Afterwards, 
this population is repeatedly made to evolve by means 
of different genetic operators (replication, crossover, 
mutation, etc.) until a determined termination criteria 
is fulfilled (for example, a sufficiently good individual 
is obtained, or a predetermined maximum number of 
generations is achieved). 

Essentially, the ANN generation process by means 
of evolutionary algorithms is divided into three main 
groups: evolution of the weights, architectures, and 
learning rules. 

Evolution of Weights 

The evolution of the weights begins with a network with 
a predetermined topology. In this case, the problem is to 
establish, by means of training, the values of the network 
connection weights. This is generally conceived as a 
problem of minimization of the network error, taken, 
for example, as the result of the Mean Square Error of 
the network between the desired outputs and the ones 
achieved by the network. Most the training algorithms, 
such as the backpropagation algorithm (BP) (Rumel- 
hart, Hinton & Williams, 1986), are based on gradient 
minimization. This has several drawbacks (Whitley, 
Starkweather & Bogart, 1990), the most important is 
that quite frequently the algorithm becomes stuck in 
a local minimum of the error function and is unable 
of finding the global minimum, especially if the error 
function is multimodal and / or non-differentiable. 
One way of overcoming these problems is to carry out 
the training by means of an Evolutionary Algorithm 
(Whitley, Starkweather & Bogart, 1990); i.e., formulate 
the training process as the evolution of the weights in 
an environment defined by the network architecture 
and the task to be done (the problem to be solved). 
In these cases, the weights can be represented in the 
individuals' genetic material as a string of binary values 
(Whitley, Starkweather & Bogart, 1990) or a string of 
real numbers (Greenwood, 1997). Traditional genetic 
algorithms (Holland, 1 975) use a genotypic codification 



method with the shape of binary strings. In this way, 
much work has emerged that codifies the values of the 
weights by means of a concatenation of the binary values 
which represent them (Whitley, Starkweather & Bogart, 
1990). The big advantage of these approximations is 
their generality and that they are very simple to apply, 
i.e., it is very easy and quick to apply the operators of 
uniform crossover and mutation on a binary string. 
The disadvantage of using this type of codification is 
the problem of permutation. This problem was raised 
upon considering that the order in which the weights 
are taken in the string causes equivalent networks to 
possibly correspond with totally different individuals. 
This leads the crossing operator to become very inef- 
ficient. Logically, the weight value codification has 
also emerged in the form of real number concatenation, 
each one of them associated with a determined weight 
(Greenwood 1997). By means of genetic operators 
designed to work with this type of codification, and 
given that the existing ones for bit string cannot be 
used here, several studies (Montana & Davis, 1989) 
showed that this type of codification produces better 
results and with more efficiency and scalability than 
the BP algorithm. 

Evolution of the Architectures 

The evolution of the architectures includes the genera- 
tion of the topological structure; i.e., the topology and 
connectivity of the neurons, and the transfer function 
of each neuron of the network. The architecture of a 
network has a great importance in order to success- 
fully apply the ANNs, as the architecture has a very 
significant impact on the process capacity of the net- 
work. In this way, on one hand, a network with few 
connections and a lineal transfer function may not be 
able to resolve a problem that another network hav- 
ing other characteristics (distinct number of neurons, 
connections or types of functions) would be able to 
resolve. On the other hand, a network having a high 
number of non-lineal connections and nodes could be 
overfitted and learn the noise which is present in the 
training as an inherent part of it, without being able to 
discriminate between them, and in the end, not have a 
good generalization capacity. Therefore, the design of 
a network is crucial, and this task is classically carried 
out by human experts using their own experience, based 
on "trial and error", experimenting with a different set 
of architectures. The evolution of architectures has 
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been possible thanks to the appearance of constructive 
and destructive algorithms (Sietsma & Dow, 1991). In 
general terms, a constructive algorithm begins with 
a minimum network (with a small number of layers, 
neurons and connections) and successively adds new 
layers, nodes and connections, if they are necessary, 
during the training. A destructive algorithm carries out 
the opposite operation, i.e., it begins with a maximum 
network and eliminates unnecessary nodes and con- 
nections during the training. However, the methods 
based on Hill Climbing algorithms are quite susceptible 
into falling to a local minimum (Angeline, Suders & 
Pollack, 1994). 

In order to develop ANN architectures by means 
of an evolutionary algorithm, it is necessary to decide 
how to codify a network inside the genotype so it can 
be used by the genetic operators. For this, different 
types of network codifications have emerged. 

In the first codification method, direct codification, 
there is a one-to-one correspondence between the genes 
and the phenotypic representation (Miller, Todd & 
Hedge, 1989). The most typical codification method 
consists of a matrix C=(c.) of NxN size which repre- 
sents an architecture of N nodes, where c. indicates the 
presence or absence of a connection between the i and 
j nodes. It is possible to use c. =1 to indicate a connec- 
tion and c. =0 to indicate an absence of connection. In 
ij 

fact, c. could take real values instead of Booleans to 
represent the value of the connection weight between 
neuron "i" and "j", and in this way, architecture and 
connections can be developed simultaneously (Alba, 
Aldana & Troya, 1993). The restrictions which are 
required in the architectures can easily be incorporated 
into this representational scheme. For example, a feed- 
forward network would have non-zero coefficients 
only in the upper right hand triangle of the matrix. 
These types of codification are generally very simple 
and easy to implement. However, they have a lot of 
disadvantages, such as scalability, the impossibility 
of codifying repeated structures, or permutation (i.e., 
different networks which are functionally equivalent 
can correspond with different genotypes) (Yao & Liu, 
1998). 

As a counterproposal to this type of direct codifi- 
cation method, there are also the indirect codification 
types in existence. With the objective of reducing the 
length of the genotypes, only some of the characteristics 
of the architecture are codified into the chromosome. 
Within this type of codification, there are various types 
of representation. 



First, the parametric representations have to be 
mentioned. The network can be represented by a set 
of parameters such as the number of hidden layers, 
the number of connections between two layers, etc. 
There are several ways of codifying these parameters 
inside the chromosome (Harp, Samad & Guha, 1989). 
Although the parametric representations can reduce the 
length of the chromosome, the evolutionary algorithm 
makes a search in a limited space within the possible 
searchable space that represents all the possible ar- 
chitectures. Another type of non-direct codification is 
based on a representational system with the shape of 
grammatical rules (Yao & Shi, 1995). In this system, 
the network is represented by a set of rules, with shape 
of production rules, which will build a matrix that 
represents the network. 

Other types of codification, more inspired in the 
world of biology, are the ones known as "growing 
methods". With them, the genotype does not codify 
the network any longer, but instead it contains a set of 
instructions. The decodification of the genotype con- 
sists of the execution of these instructions, which will 
provoke the construction of the phenotype (Husbands, 
Harvey, Cliff & Miller, 1994). These instructions usu- 
ally include neural migrations, neuronal duplication or 
transformation, and neuronal differentiation. 

Finally, and within the indirect codification meth- 
ods, there are other methods which are very different 
from the ones already described. Andersen describes 
a technique in which each individual of a population 
represents a hidden node instead of the architecture 
(Andersen & Tsoi, 1993). Each hidden layer is con- 
structed automatically by means of an evolutionary 
process which uses a genetic algorithm. This method 
has the limitation that only feed-forward networks can 
be constructed and there is also a tendency for various 
nodes with a similar functionality to emerge, which 
inserts some redundancy inside the network that must 
be eliminated. 

One important characteristic is that, in general, 
these methods only develop architectures, which is 
the most common, or else architectures and weights 
together. The transfer function of each architecture 
node is assumed to have been previously determined 
by a human expert, and that it is the same for all of 
the network nodes (at least, for all of the nodes of the 
same layer), although the transfer function has been 
shown to have a great importance on the behaviour of 
the network (Lovell & Tsoi, 1992). Few methods have 
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been developed which cause the transfer function to 
evolve, and, therefore, had little repercussion in the 
world of ANNs with EC. 

Evolution of the Learning Rule 

Another interesting approximation to the development 
of ANNs by means of EC is the evolution of the learning 
rule. This idea emerges because a training algorithm 
works differently when it is applied to networks with 
different architectures. In fact, and given that a priori, 
the expert usually has very few knowledge about a 
network, it is preferable to develop an automatic system 
to adapt the learning rule to the architecture and the 
problem to be resolved. 

There are several approximations to the evolution 
of the learning rule (Crosher, 1993) (Turney, Whitley 
& Anderson, 1996), although most of them are based 
only on how the learning can modify or guide the evo- 
lution, and in the relation between the architecture and 
the connection weights. Actually, there are few works 
that focus on the evolution of the learning rule in itself 
(Bengio & Bengio, Cloutier & Gecsei, 1992) (Ribert, 
Stocker, Lecourtier & Ennaji, 1994). 

One of the most common approaches is based on 
setting the parameters of the BP algorithm: learning 
rate and momentum. Some authors propose methods 
in which an evolutionary process is used to find these 
parameters while leaving the architecture constant 
(Kim, Jung, Kim & Park, 1996). Other authors, on 
the other hand, propose codifying these BP algorithm 
parameters together with the network architecture inside 
of the individuals of the population (Harp, Samad & 
Guha, 1989). 



FUTURE TRENDS 

The evolution of ANNs has been a research topic 
since some decades ago. The creation of new EC and, 
in general, new AI techniques and the evolution and 
improvement of the existing ones allow the develop- 
ment of new methods of automatically developing of 
ANNs. Although there are methods that (more or less) 
automatically develop ANNs, they are usually not very 
efficient, since evolution of architectures, weights and 
learning rules at once leads to having a very big search 
space, so this feature definitely has to be improved. 



CONCLUSION 

The world of EC has provided a set of tools that can 
be applied to optimization problems. In this case, the 
problem is to find an optimal architecture and/or weight 
value set and/or learning rule. Therefore, the develop- 
ment of ANNs was converted into an optimization 
problem. As the described techniques show, the use of 
EC techniques has made possible the development of 
ANNs without human intervention, or, at least, mini- 
mising the participation of the expert in this task. 

As has been explained, these techniques have 
some problems. One of them is the already explained 
permutation problem. Another problem is the loss of 
efficiency: the more complicated the structure to evolve 
is (weigths, learning rule, architecture), less efficient 
the system will be, because the search space becomes 
much bigger. If the system has to evolve several things 
at a time (for example, architecture and weights so the 
ANN development is completely automated), this loss 
of efficiency increases. However, these systems still 
work faster than the whole manual process of designing 
and training several times an ANN. 
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KEY TERMS 

Artificial Neural Networks: Interconnected set 
of many simple processing units, commonly called 
neurons, that use a mathematical model, that represents 
an input/output relation, 

Back-Propagation Algorithm: Supervised learn- 
ing technique used by ANNs, that iteratively modifies 
the weights of the connections of the network so the 
error given by the network after the comparison of the 
outputs with the desired one decreases. 

Evolutionary Computation: Set of Artificial In- 
telligence techniques used in optimization problems, 
which are inspired in biologic mechanisms such as 
natural evolution. 
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Genetic Programming: Machine learning tech- 
nique that uses an evolutionary algorithm in order to 
optimise the population of computer programs accord- 
ing to a fitness function which determines the capability 
of a program for performing a given task. 

Genotype: The representation of an individual on 
an entire collection of genes which the crossover and 
mutation operators are applied to. 

Phenotype: Expression of the properties coded by 
the individual's genotype. 

Population: Pool of individuals exhibiting equal or 
similar genome structures, which allows the application 
of genetic operators. 

Search Space: Set of all possible situations of the 
problem that we want to solve could ever be in. 
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INTRODUCTION 

A major step for high-quality optical devices faults 
diagnosis concerns scratches and digs defects detec- 
tion and characterization in products. These kinds of 
aesthetic flaws, shaped during different manufacturing 
steps, could provoke harmful effects on optical devices' 
functional specificities, as well as on their optical per- 
formances by generating undesirable scatter light, which 
could seriously damage the expected optical features. 
A reliable diagnosis of these defects becomes therefore 
a crucial task to ensure products' nominal specifica- 
tion. Moreover, such diagnosis is strongly motivated 
by manufacturing process correction requirements in 
order to guarantee mass production quality with the 
aim of maintaining acceptable production yield. 

Unfortunately, detecting and measuring such defects 
is still a challenging problem in production conditions 
and the few available automatic control solutions remain 
ineffective. That's why, in most of cases, the diagnosis 
is performed on the basis of a human expert based 
visual inspection of the whole production. However, 
this conventionally used solution suffers from several 
acute restrictions related to human operator's intrinsic 
limitations (reduced sensitivity for very small defects, 
detection exhaustiveness alteration due to attentiveness 
shrinkage, operator's tiredness and weariness due to 
repetitive nature of fault detection and fault diagnosis 
tasks). 

To construct an effective automatic diagnosis 
system, we propose an approach based on four main 



operations: defect detection, data extraction, dimen- 
sionality reduction and neural classification. The first 
operation is based on Nomarski microscopy issued 
imaging. These issued images contain several items 
which have to be detected and then classified in order 
to discriminate between "false" defects (correctable 
defects) and "abiding" (permanent) ones. Indeed, 
because of industrial environment, a number of cor- 
rectable defects (like dusts or cleaning marks) are 
usually present beside the potential "abiding" defects. 
Relevant features extraction is a key issue to ensure 
accuracy of neural classification system; first because 
raw data (images) cannot be exploited and, moreover, 
because dealing with high dimensional data could affect 
learning performances of neural network. This article 
presents the automatic diagnosis system, describing the 
operations of the different phases. An implementation 
on real industrial optical devices is carried out and an 
experiment investigates a MLP artificial neural network 
based items classification. 



BACKGROUND 

Today, the only solution which exists to detect and 
classify optical surfaces' defects is a visual one, carried 
out by a human expert. The first originality of this work 
is in the sensor used: Normarski microscopy. Three 
main advantages distinguishing Nomarski microscopy 
(known also as "Differential Interference Contrast 
microscopy" (Bouchareine, 1999) (Chatterjee, 2003)) 
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from other microscopy techniques, have motivated our 
preference for this imaging technique. The first of them 
is related to the higher sensitivity of this technique 
comparing to the other classical microscopy techniques 
(Dark Field, Bright Field) (Flewitt & Wild, 1994). 
Furthermore, the DIC microscopy is robust regarding 
lighting non-homogeneity. Finally, this technology 
provides information relative to depth (3-th dimen- 
sion) which could be exploited to typify roughness or 
defect's depth. This last advantage offers precious ad- 
ditional potentiality to characterize scratches and digs 
flaws in high-tech optical devices. Therefore, Nomarski 
microscopy seems to be a suitable technique to detect 
surface imperfections. 

On the other hand, since they have shown many 
attractive features in complex pattern recognition and 
classification tasks (Zhang, 2000) (Egmont-Petersen, 
de Ridder, & Handels, 2002), artificial neural network 
based techniques are used to solve difficult problems. 
In our particular case, the problem is related to the 
classification of small defects on a great observation's 
surface. These promising techniques could however 
encounter difficulties when dealing with high dimen- 
sional data. That's why we are also interested in data 
dimensionality reducing methods. 



DEFECTS' DETECTION AND 
CLASSIFICATION 

The suggested diagnosis process is described in broad 
outline in the diagram of Figure 1. Every step is pre- 
sented, first detection and data extraction phases and 
then classification phase coupled with dimensionality 
reduction. In a second part, some investigations on real 
industrial data are carried out and the obtained results 
are presented. 

Detection and Data Extraction 



proposed method (Voiry, Houbre, Amarger, &Madani, 
2005) includes four phases: 

Pre-processing: DIC issued digital image trans- 
formation in order to reduce lighting heterogene- 
ity influence and to enhance the aimed defects' 
visibility, 

Adaptive matching: adaptive process to match 
defects, 

Filtering and segmentation: noise removal and 
defects' outlines characterization. 
Defect image extraction: correct defect represen- 
tation construction. 

Finally, the image associated to a given detected 
gives an isolated (from other items) representation 
of the defect (e.g. depicts the defect in its immediate 
environment), like depicted in Figure 2. 

But, information contained in such generated 
images is highly redundant and these images don't 
have necessarily the same dimension (typically this 
dimension can turn out to be hundred times as high). 
That is why this raw data (images) can not be directly 
processed and has first to be appropriately encoded, 
using some transformations. Such ones must naturally 
be invariant with regard to geometric transformations 
(translation, rotation and scaling) and robust regarding 
different perturbations (noise, luminance variation and 
background variation). Fourier-Mellin transformation 
is used as it provides invariant descriptors, which are 
considered to have good coding capacity in classifica- 
tion tasks (Choksuriwong, Laurent, & Emile, 2005) 
(Derrode, 1999) (Ghorbel, 1994). Finally, the processed 
features have to be normalized, using the centring-re- 
ducing transformation. Providing a set of 13 features 
using such transform, is a first acceptable compromise 
between industrial environment real-time processing 
constraints and defect image representation quality 
(Voiry, Madani, Amarger, & Houbre, 2006). 



The aim of defect's detection stage is to extract defects 
images from DIC detector issued digital image. The 



Figure 1. Block diagram of the proposed defect diagnosis system 
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Figure 2. Images of characteristic items: (a) Scratch; (b) dig; (c) dust; (d) cleaning marks 
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Dimensionality Reduction 

To obtain a correct description of defects, we must 
consider more or less important number of Fourier-Mel- 
lin invariants. But dealing with high-dimensional data 
poses problems, known as "curse of dimensionality" 
(Verleysen, 2001). First, sample number required to 
reach a predefined level of precision in approximation 
tasks increases exponentially with dimension. Thus, 
intuitively, the sample number needed to properly 
learn problem becomes quickly much too large to be 
collected by real systems, when dimension of data 
increases. Moreover surprising phenomena appear 
when working in high dimension (Demartines, 1994): 
for example, variance of distances between vectors 
remains fixed while its average increases with the space 
dimension, and Gaussian kernel local properties are 
also lost. These last points explain that behaviour of a 
number of artificial neural network algorithms could 
be affected while dealing with high-dimensional data. 
Fortunately, most real-world problem data are located 
in a manifold of dimension p (the data intrinsic dimen- 
sion) much smaller than its raw dimension. Reducing 
data dimensionality to this smaller value can therefore 
decrease the problems related to high dimension. 

In order to reduce the problem dimensionality, we 
use Curvilinear Distance Analysis (CDA). This tech- 
nique is related to Curvilinear Component Analysis 
(CCA), whose goal is to reproduce the topology of a 
n-dimension original space in a new p-dimension space 
(where p<n) without fixing any configuration of the 
topology (Demartines & Herault, 1993). To do so, a 
criterion characterizing the differences between original 
and projected space topologies is processed: 



£ CCA =^II(d/-d/) 2 F(d/) 



(1) 



Where d n (respectively d? ) is the Euclidean distance 

between vectors x.and x. of considered distribution in 

i j 

original space (resp. in projected space), and F is a 
decreasing function which favours local topology with 
respect to the global topology. This energy function is 
minimized by stochastic gradient descent (Demartines 
& Herault, 1995): 
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Where a : 9T -> [0;1] and X : 9T -> 9T are two de- 
creasing functions representing respectively a learning 
parameter and a neighbourhood factor. CCA provides 
also a similar method to proj ect, in continuous way, new 
points in the original space onto the projected space, 
using the knowledge of already projected vectors. 

But, since CCA encounters difficulties with unfold- 
ing of very non-linear manifolds, an evolution called 
CDA has been proposed (Lee, Lendasse, Donckers, 
& Verleysen, 2000). It involves curvilinear distances 
(in order to better approximate geodesic distances on 
the considered manifold) instead of Euclidean ones. 
Curvilinear distances are processed in two steps way. 
First is built a graph between vectors by consider- 
ing k-NN, 8, or other neighbourhood, weighted by 
Euclidean distance between adjacent nodes. Then the 
curvilinear distance between two vectors is computed 
as the minimal distance between these vectors in the 
graph using Dijkstra's algorithm. Finally the original 
CCA algorithm is applied using processed curvilinear 
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distances. This algorithm allows dealing with very 
non-linear manifolds and is much more robust against 
the choices of a and X functions. 

It has been successfully used as a preliminary step 
before maximum likelihood classification in (Lennon, 
Merrier, Mouchot, & Hubert-Moy, 2001) and we have 
also showed its positive impact on neural network 
technique based classification performance (Voiry, 
Madani, Amarger, & Bernier, 2007). In this last paper, 
we have first demonstrated that a synthetic problem 
(nevertheless defined from our real industrial data) 
whose intrinsic dimensionality is two, is better treated 
by MLP after 2D dimension reduction than in its raw 
expression. We have also showed that CDA performs 
better for this problem than CCA and Self Organizing 
Map pre-processing. 

Implementation on Industrial Optical 
Devices 

In order to validate the above-presented concepts and 
to provide an industrial prototype, an automatic control 
system has been realized. It involves an Olympus B52 
microscope combined with a Corvus stage, which al- 
lows scanning an entire optical component (presented 
in Figure 3). 50x magnification is used, that leads to 
microscopic 1 .77 mm x 1 .33 mm fields and 1 .28 jam x 
1.28 jam sized pixels. The proposed image processing 
method is applied on-line. A post-processing software 
enables to collect pieces of a defect that are detected in 
different microscopic fields (for example pieces of a 
long scratch) to form only one defect, and to compute 
an overall cartography of checked device (Figure 3). 
These facilities were used to acquire a great number 
of Nomarski images, from which were extracted de- 
fects images using aforementioned technique. Two 



experiments called A and B were carried out, using two 
different optical devices. Table 1 shows the different 
parameters corresponding to these experiments. It's 
important to note that, in order to avoid false classes 
learning, items images depicting microscopic field 
boundaries or two (or more) different defects were 
discarded from used database. Furthermore, studied 
optical devices were not specially cleaned, what ac- 
counts for the presence of some dusts and cleaning 
marks. Items of these two databases were labelled by 
an expert with two different labels: "dust" (class 1) and 
"other defects" (class -1). Table 1 shows also items 
repartition between the two defined classes. 

Using this experimental set-up, classification experi- 
ment was performed. It involved a multilayer perceptron 
with n input neurons, 35 neurons in one hidden layer, 
and 2 output neurons (n-35-2) MLR First this artificial 
neural network was trained for discrimination task be- 
tween classes 1 and -1, using database B. This training 
phase used BFGS (Broyden, Fletcher, Goldfarb, and 
Shanno) with Bayesian regularization algorithm, and 
was achieved 5 times. Subsequently, the generaliza- 
tion ability of obtained neural network was processed 
using database A. Since database A and B issued from 
different optical devices, such generalization results 
are significant. Following this procedure, 14 different 
experiments were conducted with the aim of studying 
the global classification performance and the impact 
of CDA dimensionality reduction on this performance. 
First experiment used original Fourrier-Mellin issued 
features (13-dimensional), the others used the same 
features after CDA n-dimensional space reduction 
(with n varying between 2 and 13). Figure 4 depicts 
global classification performances (calculated by av- 
eraging percentage of well-classified items for the 5 
trainings) for the 14 different experiments, as well as 



Figure 3. Automatic control system and cartography of a 100mm x 65mm optical device 
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Table 1. Description of the two databases used for validation experiments 



Database 


Optical 
Device 


Number of 

microscopic 

fields 


Corresponding 
area 


Total items 
number 


Class 1 items 
number 


Class -1 items 
number 


A 


1 


1178 


28 cm 2 


3865 


275 


3590 


B 


2 


605 


14 cm 2 


1910 


184 


1726 




Figure 4. Classification performances for different CDA issued data dimensionality. Classification performances 
using raw data (13-dimensional) are also depicted as dotted lines. 




-Class -1 classification perform a nee 
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-Global classification performance 



Data Dimensionality 



class 1 classification and class -1 classification perfor- 
mances. It shows first that equivalent performances can 
be obtained using only 5-dimensional data instead of 
unprocessed defects representations (13-dimensional). 
As a consequence neural architecture complexity and 
therefore processing time can be saved using CDA 
dimensionality reduction, while keeping performance 
level. Moreover, obtained scores are satisfactory: about 
70% of "dust" defects are well-recognized (this can be 
enough for aimed application) as well as about 97% 
of other defects (the few 3% errors can however pose 
problems because every "permanent" defect has to be 
reported). Furthermore, we think that this significant 
performances difference between class 1 and class -1 
recognition is due to the fact that class 1 is underrep- 
resented in learning database. 



FUTURE TRENDS 

Next phase of this work will deal with classification 
tasks involving more classes. We want also use much 
more Fourier-Mellin invariants, because we think 
that it would improve classification performance by 
supplying additional information. In this case, CDA 
based dimensionality reduction technique would be a 
foremost step to keep reasonable classification system's 
complexity and processing time. 



CONCLUSION 

A reliable diagnosis of aesthetic flaws in high-quality 
optical devices is a crucial task to ensure products' 
nominal specification and to enhance the production 
quality by studying the impact of the process on such 
defects. To ensure a reliable diagnosis, an automatic 
system is needed to detect defects and secondly dis- 
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criminate the "false" defects (correctable defects) from 
"abiding" (permanent) ones. In this paper is described 
a complete framework, which allows detecting all de- 
fects present in a raw Nomarski image and extracting 
pertinent features for classification of these defects. 
Obtained proper performances for "dust" versus "other" 
defects classification task with MLP neural network has 
demonstrated the pertinence of proposed approach. In 
addition, data dimensionality reduction permits to use 
low complexity classifier (while keeping performance 
level) and therefore to save processing time. 
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KEY TERMS 

Artificial Neural Networks: A network of many 
simple processors ("units" or "neurons") that imitates 
a biological neural network. The units are connected 
by unidirectional communication channels, which 
carry numeric data. Neural networks can be trained 
to find nonlinear relationships in data, and are used 
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in applications such as robotics, speech recognition, 
signal processing or medical diagnosis. 

Backpropagation algorithm: Learning algorithm 
of ANNs, based on minimising the error obtained from 
the comparison between the outputs that the network 
gives after the application of a set of network inputs 
and the outputs it should give (the desired outputs). 

Classification: Affectation of a phenomenon to a 
predefined class or category by studying its character- 
istic features. In our work it consists in determining the 
nature of detected optical devices surface defects (for 
example "dust" or "other type of defects"). 

Data Dimensionality Reduction: Data dimension- 
ality reduction is the transformation of high-dimensional 
data into a meaningful representation of reduced dimen- 
sionality. The goal is to find the important relationships 
between parameters and reproduce those relationships 
in a lower dimensionality space. Ideally, the obtained 
representation has a dimensionality that corresponds to 
the intrinsic dimensionality of the data. Dimensional- 
ity reduction is important in many domains, since it 
facilitates classification, visualization, and compression 
of high-dimensional data. In our work it's performed 
using Curvilinear Distance Analysis. 

Data Intrinsic Dimension: When data is described 
by vectors (sets of characteristic values), data intrinsic 
dimension is the effective number of degrees of free- 
dom of the vectors' set. Generally, this dimension is 
smaller than the data raw dimension because it may 
exist linear and/or non-linear relations between the 
different components of the vectors. 



Data Raw Dimension: When data is described 
by vectors (sets of characteristic values), data raw 
dimension is simply the number of components of 
these vectors. 

Detection: Identification of a phenomenon among 
others from a number of characteristic features or 
"symptoms". In our work, it consists in identifying 
surface irregularities on optical devices. 

MLP (Multi Layer Perceptron): This widely 
used artificial neural network employs the perceptron 
as simple processor. The model of the perceptron, 
proposed by Rosenblatt is as follows: 




In this diagram, the X represent the inputs and 
Y the output of the neuron. Each input is multiplied 
by the weight w, a threshold b is subtracted from the 
result and finally Y is processed by the application of 
an activation function f . The weights of the connection 
are adjusted during a learning phase using backpropa- 
gation algorithm. 
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INTRODUCTION 

Governments and institutions are facing the new de- 
mands of a rapidly changing society. Among many 
significant trends, some facts should be considered 
(Silverstein, 2006): (1) the increment of number and 
type of students; and (2) the limitations imposed by 
educational costs and course schedules. About the for- 
mer, the need of a continuous update of knowledge and 
competences in an evolving work environment requires 
life-long learning solutions. An increasing number of 
young adults are returning to classrooms in order to 
finish their graduate degrees or attend postgraduate 
programs to achieve an specialization on a certain 
domain. About the later, due to the emergence of new 
types of students, budget constraints and schedule 
conflicts appear. Workers and immigrants, for instance, 
are relevant groups for which educational costs and 
job incompatible schedules could be the key factor 
to register into a course or to give up a program after 
investing time and effort on it. In order to solve the 
needs derived from this social context, new educational 
approaches should be proposed: (1) to improve and 
extend the online learning courses, which would reduce 
student costs and allows to cover the educational needs 
of a higher number of students, and (2) to automate 
learning processes, then reducing teacher costs and 
providing a more personalized educational experience 
anytime, anywhere. 

As a result of this context, in the last decade an 
increasing interest on applying computer technologies 
in the field of Education has been observed. On this 
regard, the paradigms of the Artificial Intelligence 
(AI) field are attracting an special attention to solve 
the issues derived from the introduction of computers 
as supporting resources of different learning strategies. 
In this paper we review the state-of-art of the applica- 
tion of Artificial Intelligence techniques in the field of 
Education, focusing on (1) the most popular educa- 



tional tools based on AI, and (2) the most relevant AI 
techniques applied on the development of intelligent 
educational systems. 



EXAMPLES OF EDUCATIONAL TOOLS 
BASED ON AI 

The field of Artificial Intelligence can contribute with 
interesting solutions to the needs of the educational 
domain (Kennedy, 2002). In what follows, the type 
of systems that can be built based on AI techniques 
are outlined. 

Intelligent Tutoring Systems 

The Intelligent Tutoring Systems are applications 
that provide personalized/adaptive learning without 
the intervention of human teachers (VanLehn, 2006). 
They are constituted by three main components: (1) 
knowledge of the educational contents, (2) knowledge 
of the student, and (3) knowledge of the learning pro- 
cedures and methodologies. These systems promise to 
radically transform our vision of online learning. As 
opposed to the hypertext-based e-learning applications, 
which provide the students with a certain number of 
opportunities to search for the correct answer before 
showing it, the intelligent tutoring systems perform 
like coaches not only after the introduction of the re- 
sponse, but also offering suggestions when the students 
doubt or are blocked during the process of solving the 
problem. In this way, the assistance guide the learning 
process rather than merely saying what is correct or 
what is wrong. 

There exist numerous examples of intelligent tutor- 
ing systems, some of them developed at universities 
as research projects while others created with business 
goals. Among the first ones, the Andes systems (Van- 
Lehn, Lynch, Schulze, Shapiro, Shelby, Taylor, Treacy, 
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Weinstein & Wintersgill, 2005), developed under the 
guidance of Kurt VanLehn of the University of Pittsburg, 
is a popular example. The system is in charge of guid- 
ing the students while they try to solve different sets of 
problems and exercises. When the student ask for help 
in the middle of an activity, the system either provides 
hints in order to step further towards the solution or 
points out what was wrong in some earlier step. Andes 
was successfully evaluated during 5 years in the Naval 
Academy of the United States and can be downloaded 
for free. Another relevant system is Cognitive Tutor 
(Koedinger, Anderson, Hadley & Mark, 1997), is a 
comprehensive secondary mathematics curricula and 
computer-based tutoring program developed by John R. 
Anderson, professor at the Carnegie Mellon University. 
The Cognitive Tutor is an example of how research 
prototypes can be evolved into commercial solutions, 
as it is nowadays used in 1,500 schools in the United 
States. On the business side, Read-On! is presented as 
a product that teaches reading comprehension skills 
for adults. It analyzes and diagnoses the specific defi- 
ciencies and problems of each student and then adapts 
the learning process based on that features (Read On, 
2007). It includes an authoring tool that allows course 
designers to adapt course contents to different student 
profiles in a fast and flexible way. 

Automatic Evaluation Systems 

Automatic Evaluation Systems are mainly focused on 
evaluating the strengths and weaknesses of students in 
different learning activities through assessment tests 
(Conejo, Guzman, Millan, Trella, Perez-de-la-Cruz. 
& Rios, 2004). In this way, these systems not only 
perform the automatic correction of the test, but also 
derive automatically useful information about the 
competences and skills obtained by the students during 
the educational process. 

Among the automatic evaluation systems, we could 
highlight ToL (Test On Line) (Tartaglia & Tresso, 2002), 
which have been used by Physics students in the Poly- 
technic University of Milano. The system is composed 
of a database of tests, an algorithm for question selec- 
tion, and a mechanism for the automatic evaluation 
of tests, which can be additionally configured by the 
teachers. CELLA (Comprehensive English Language 
Learning Assesment) (Cella, 2007) is another system 
that evaluates the student competence on using and 
understanding the English language. The application 



shows the progress carried out by the students and 
determines their proficiency and degree of competence 
on the use of foreign languages. As for commercial 
applications, Intellimetric is a Web-based system that 
lets students to submit their work online (Intellimetric, 
2007). In a few seconds, the Al-supported grading 
engine automatically provides the score of the work. 
The company claims a reliability of 99%, meaning that 
99 percent of the time the engine's scores match those 
provided by human teachers. 

Computer Supported Collaborative 
Learning 

The environments of computer supported collaborative 
learning are aimed at facilitating the learning process 
providing the students both the context and tools to 
interact and work in a collaborative way with their class- 
mates (Soller, Martinez, Jermann & Muehlenbrock, 
2005). In intelligent-based systems, the collaboration is 
usually carried out with the help of software agents in 
charge of mediating and supporting student interaction 
to achieve the proposed learning objectives. 

The research prototypes are the suitable test-beds 
to prove new ideas and concepts, to provide the best 
collaborative strategies. The DEGREE system, for 
instance, allows the characterization of group behav- 
iours as well as the individual behaviours of the people 
constituting them, on the basis of a set of attributes 
or tags. The mediator agent utilizes those attributes, 
which are introduced by students, in order to provide 
recommendations and suggestions to improve the in- 
teraction inside each group (Barros & Verdejo, 2000). 
In the business domain there exist multiple solutions 
although they do not offer intelligent mediation to 
facilitate the collaborative interactions. The DEBBIE 
system (DePauw Electronic Blackboard for Interactive 
Education) is one of the most popular (Berque, John- 
son, Hutcheson, Jovanovic, Moore, Singer & Slattery, 
2000). It was originally developed at the beginning of 
year 2000 at the University of Depauw, and managed 
later by the DyKnow company, which was specifically 
createdtomakeprofitwithDEBBIE(Schnitzler,2004). 
The technology that currently offers DyKnow allows 
both teachers and students to instantaneously share 
information and ideas. The final goal is to support 
student tasks in the classroom by eliminating the need 
of performing simple tasks, as for instance backing up 
the teacher 's presentations. The students could therefore 
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be more focused on understanding as well as analyzing 
the concepts presented by the teacher. 

Game-Based Learning 

Learning based on serious games, a term coined to 
distinguish between learning-oriented games used in 
education and purely entertaining-oriented games, deal 
with the utilization of the motivational power and at- 
tractiveness of games in the educational domain in order 
to improve the satisfaction and performance of students 
when acquiring new knowledge and skills. This type 
of learning allows to carry out activities in complex 
educational environments that would be impossible to 
implement, because of budget, time, infrastructure and 
security limitations, with traditional resources (Michael 
& Chen, 2005; Corti, 2006). 

NetAid's is an institution that develop games to 
teach concepts of global citizenship and to sensitize to 
fight against poverty. One of its first games, released 
in 2002, called NetAid World Class, consists on taking 
the identity of a real child living in India and to resolve 
the real problems that confront the poor children in this 
region (Stokes, 2005). In 2003 the game was used by 
40.000 students in different Schools across the United 
States. In the business and entertainment arena, many 
games exist that can be resorted to reach educational 
goals. Among the most popular ones, Brain Training of 
Nintendo (Brain Training, 2007) challenges the user to 
improve her mental shape by doing memory, reasoning 
and mathematical exercises. The final goal is to reach 
an optimal cerebral age after some regular training. 



Al TECHNIQUES IN EDUCATION 

The intelligent educational systems reviewed above are 
based on a diversity of artificial intelligence techniques 
(Brusilovsky & Peylo, 2003). The most frequently 
used in the field of education are: (1) personalization 
mechanisms based on student and group models, (2) 
intelligent agents and agent-based systems, and (3) 
ontologies and semantic web techniques. 

Personalization Mechanisms 

The personalization techniques, which are the basis of 
intelligent tutoring systems, involve the creation and 



use of student models. Broadly speaking, these models 
imply the construction of a qualitative representation 
of student behavior in terms of existing background 
knowledge about a domain (McCalla, 1992). These 
representations can be further used in intelligent tutor- 
ing systems, intelligent learning environments, and 
to develop autonomous intelligent agents that may 
collaborate with human students during the learning 
process. The introduction of machine learning tech- 
niques facilitates to update and extend the first versions 
of student models in order to adapt to the evolution 
of each student as well as the possible changes and 
modifications of contents and learning activities (Sison 
& Shimura, 1998). The most popular student model- 
ing techniques are (Beck, Stern, & Haugsjaa, 1996): 
overlay models and bayesian network models. The first 
method consists on considering the student model as a 
subset of the knowledge of an expert in the domain on 
which the learning is taking place. In fact, the degree 
of learning is measured in terms of the comparison 
between the knowledge acquired and represented in 
the student model with the background initially stored 
in the expert model. The second method deals with the 
representation of the learning process as a network 
of knowledge states. Once defined, the model should 
infer, from the tutor-student interaction, the probability 
of the student on being in a certain state. 

Intelligent Agents and Agent-Based 
Systems 

Software agents are considered software entities, such as 
software programs or robots, that present, with different 
degree, three main attributes: autonomy, cooperation 
and learning (Nwana, 1996). Autonomy refers to the 
principle that an agent can operate on their own (act- 
ing and deciding upon its own representation of the 
world). Cooperation refers to the ability to interact 
with other agents via some communication language. 
Finally, learning is essential to react or interact with 
the external environment. Teams of intelligent agents 
build up MultiAgent Systems (MAS). In this type of 
systems each agent has either incomplete information 
or limited capabilities for solving the problem at hand. 
Other important aspect concerns with the lack of cen- 
tralized global control; therefore, data is distributed 
all over the system and computation is asynchronous 
(Sycara, 1998). Many important tasks can be carried 
out by intelligent agents in the context of learning and 
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educational systems (Jafari, 2002, Sanchez, Lama, 
Amorim, Riera, Vila & Barro, 2003): the monitoring 
of inputs, outputs, and the activity outcomes produced 
by the students; the verification of deadlines during 
homework and exercise submission; automatic answer- 
ing of student questions; and the automatic grading of 
tests and surveys. 



Research in this field is very active and faces am- 
bitious goals. In some decades it could be possible to 
dream about sci-fi environments in which the students 
would have brain interfaces to directly interact with an 
intelligent assistant (Koch, 2006), which would play 
the role of a tutor with a direct connection with learn- 
ing areas of the brain. 



Ontologies and Semantic Web 
Techniques 

Ontologies aim to capture and represent consensual 
knowledge in a generic way, and that they may be reused 
and shared across software applications (Gomez-Perez, 
Fernandez-Lopez & Corcho, 2004). An ontology is 
composed of concepts or classes and their attributes, 
the relationships between concepts, the properties of 
these relationships, and the axioms and rules that ex- 
plicitly represents the knowledge of a certain domain. 
In the educational domain, several ontologies have 
been proposed: (1) to describe the learning contents 
of technical documents (Kabel, Wielinga, & de How, 
1999), (2) to model the elements required for the 
design, analysis, and evaluation of the interaction 
between learners in computer supported cooperative 
learning (Inaba, Tamura, Ohkubo, Ikeda, Mizoguchi 
& Toyoda, 2001), (3) to specify the knowledge needed 
to define new collaborative learning scenarios (Barros, 
Verdejo, Read & Mizoguchi, 2002), (4) to formalize the 
semantics of learning obj ects that are based on metadata 
standards (Brase & Nejdl, 2004), and (5) to describe 
the semantics of learning design languages (Amorim, 
Lama, Sanchez, Riera & Vila, 2006). 



FUTURE TRENDS 

The next generation of adaptive environments will in- 
tegrate pedagogical agents, enriched with data mining 
and machine learning techniques, capable of providing 
cognitive diagnosis of the learners that will help to 
determine the state of the learning process and then 
optimize the selection of personalized learning designs. 
Moreover, improved models of learners, facilitators, 
tasks and problem-solving processes, combined with the 
use of Ontologies and reasoning engines, will facilitate 
the execution of learning activities on either online 
platforms or traditional classroom settings. 



CONCLUSION 

In this paper we have reviewed the state-of-art of the 
application of Artificial Intelligence techniques in the 
field of Education. AI approaches seem promising to 
improve the quality of the learning process and then 
to satisfy the new requirements of a rapidly changing 
society. Current Al-based systems such as intelligent 
tutoring systems, computer supported collaborative 
learning and educational games have already proved 
the possibilities of applying AI techniques. Future 
applications will both facilitate personalized learning 
styles and help the tasks of teachers and students in 
traditional classroom settings. 
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KEY TERMS 

Automatic Evaluation Systems: Applications 
focused on evaluating the strengths and weaknesses 
of students in different learning activities through as- 
sessment tests. 

Computer Supported Collaborative Learning 

(CSCL): A research topic on supporting collaborative 
learning methodologies with the help of computers and 
collaborative tools. 



Game-Based Learning: Anew type of learning that 
combines educational content and computer games in 
order to improve the satisfaction and performance of 
students when acquiring new knowledge and skills. 

Intelligent Tutoring Systems: Acomputer program 
that provides personalized/adaptive instruction to stu- 
dents without the intervention of human beings. 

Ontologies: A set of concepts within a domain 
that capture and represent consensual knowledge in a 
generic way, and that they may be reused and shared 
across software applications. 

Software Agents: Software entities, such as 
software programs or robots, characterized by their 
autonomy, cooperation and learning capabilities. 

Student Models: Representation of student be- 
havior and degree of competence in terms of existing 
background knowledge about a domain. 
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INTRODUCTION 

Breakwaters are coastal structures constructed to shelter 
a harbour basin from waves. There are two main types: 
rubble-mound breakwaters, consisting of various layers 
of stones or concrete pieces of different sizes (weights), 
making up a porous mound; and vertical breakwaters, 
impermeable and monolythic, habitually composed of 
concrete caissons. This article deals with rubble-mound 
breakwaters. 

A typical rubble-mound breakwater consists of an 
armour layer, a filter layer and a core. For the breakwater 
to be stable, the armour layer units (stones or concrete 
pieces) must not be removed by wave action. Stability 
is basically achieved by weight. Certain types of con- 
crete pieces are capable of achieving a high degree of 
interlocking, which contributes to stability by impeding 
the removal of a single unit. 

The forces that an armour unit must withstand un- 
der wave action depend on the hydrodynamics on the 
breakwater slope, which are extremely complex due 
to wave breaking and the porous nature of the struc- 
ture. A detailed description of the flow has not been 
achieved until now, and it is unclear whether it will 
be in the future in view of the turbulent phenomena 
involved. Therefore the instantaneous force exerted 
on an armour unit is not, at least for the time being, 
amenable to determination by means of a numerical 
model of the flow. For this reason, empirical formu- 
lations are used in rubble-mound design, calibrated 
on the basis of laboratory tests of model structures. 
However, these formulations cannot take into account 



all the aspects affecting the stability, mainly because 
the inherent complexity of the problem does not lend 
itself to a simple treatment. Consequently the empirical 
formulations are used as a predesign tool, and physical 
model tests in a wave flume of the particular design in 
question under the pertinent sea climate conditions are 
de rigueur, except for minor structures. The physical 
model tests naturally integrate all the complexity of the 
problem. Their drawback lies in that they are expensive 
and time consuming. 

In this article, Artificial Neural Networks are trained 
and tested with the results of stability tests carried out 
on a model breakwater. They are shown to reproduce 
very closely the behaviour of the physical model in 
the wave flume. Thus an ANN model, if trained and 
tested with sufficient data, may be used in lieu of the 
physical model tests. A virtual laboratory of this kind 
will save time and money with respect to the conven- 
tional procedure. 



BACKGROUND 

Artificial Neural Networks have been used in civil 
engineering applications for some time, especially in 
Hydrology (Ranjithan et al., 1993; Fernando and Jay- 
awardena, 1998; Govindaraju and Rao, 2000; Maier 
and Dandy, 2000; Dawson and Wilby, 2001; Cigizoglu, 
2004); some Ocean Engineering issues have also been 
tackled (Mase et al., 1995; Tsai et al., 2002; Lee and 
Jeng, 2002; Medina et al., 2003; Kim and Park, 2005; 
Yagci et al., 2005). Rubble-mound breakwater stabil- 
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ity is studied in Mase et al.'s (1995) pioneering work, 
focusing on a particular stability formula. Medina et 
al. (2003) train and test an Artificial Neural Network 
with stability data from six laboratories. The inputs 
are the relative wave height, the Iribarren number and 
a variable representing the laboratory. Kim and Park 
(2005) compare different ANN models on an analysis 
revolving around one empirical stability formula, as did 
Mase et al.'s (1995). Yagci et al. (2005) apply different 
kinds of neural networks and fuzzy logic, characterising 
the waves by their height, period and steepness. 



PHYSICAL MODEL AND ANN MODEL 

The Artificial Neural Networks were trained and tested 
on the basis of laboratory tests carried out in a wave 
flume of the CITEEC Laboratory, University of La 
Coruna. The flume section is 4 m wide and 0.8 m high, 
with a length of 33 m (Figure 1). Waves are generated 
by means of a piston-type paddle, controlled by an 
Active Absorption System (AWACS) which ensures 
that the waves reflected by the model are absorbed at 
the paddle. 

The model represents a typical three-layer rubble- 
mound breakwater in 15 m of water, crowned at +9.00 
m, at a 1:30 scale. Its slopes are 1:1.50 and 1:1.25 



on the seaward and leeward sides, respectively. The 
armour layer consists in turn of two layers of stones 
with a weight W=69 g ±10%; those in the upper layer 
are painted in blue, red and black following horizontal 
bands, while those in the lower layer are painted in 
white, in order to easily identify after a test the dam- 
aged areas, i.e., the areas where the upper layer has 
been removed. The filter layer is made up of a gravel 
with a median size D 50 = 15.11 mm and a thickness of 
4 cm. Finally, the core consists of a finer gravel, with 
D 50 = 6.95 mm, D 15 = 5.45 mm, and D 85 = 8.73 mm, 
and a porosity n = 42%. The density of the stones and 
gravel is y r = 2700 kg/m 3 . 

Waves were measured at six different stations along 
the longitudinal, or x-axis, of the flume. With the origin 
of x located at the rest position of the wave paddle, the 
first wave gauge, S 1 , was located at x=7.98 m. A group 
of three sensors, S2, S3 and S4, was used to separate 
the incident and the reflected waves. The central wave 
gauge, S3, was placed at x=12.28 m, while the position 
of the others, S2 and S4, was varied according to the 
wave generation period of each test (Table 1). Another 
wave gauge, S5, was located 25 cm in front of the model 
breakwater toe, at x= 13.47 m, and 16 cm to the right 
(as seen from the wave paddle) of the flume centreline, 
so as not to interfere with the video recording of the 



Figure 1. Experimental set-up 
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Table 1. Relative water depth (kh), wave period (T), and separation between sensors S2, S3 and S4 in the stabil- 
ity tests 
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tests. Finally, a wave gauge (S6) was placed to the lee 
of the model breakwater, at x= 18.09 m. 

Both regular and irregular waves were used in the 
stability tests. This article is concerned with the eight 
regular wave tests, carried out with four different wave 
periods. The water depth in the flume was kept constant 
throughout the tests (/i=0.5 m). Each test consisted of a 
number of wave runs with a constant value of the wave 
period T, related to the wavenumber k by 

r = 27c[flfctanh(Wi)p 

where g is the gravitational acceleration. The wave 
periods and relative water depths (kh) of the tests are 
shown in Table 1. 

Each wave run consisted of 200 waves. In the first 
run of each test, the generated waves had a model 
height H=6 cm (corresponding to a wave height in the 
prototype H = 1 .80 m); in the subsequent runs, the wave 
height was increased in steps of 1 cm (7 cm, 8 cm, 9 
cm, etc.), so that the model breakwater was subject to 
ever more energetic waves. 

Four damage levels (Losada et a/., 1986) were 
used to characterize the stability situation of the model 
breakwater after each wave run: 

(0) No damage. No armour units have been moved 
from their positions. 

(1) Initiation of damage. Five or more armour units 
have been displaced. 

(2) Iribarren damage. The displaced units of the 
armour's first (outer) layer have left uncovered 
an area of the second layer large enough for a 
stone to be removed by waves. 

(3) Initiation of destruction. The first unit of the 
armour 's second layer has been removed by wave 
action. 

As the wave height was increased through a test, 
the damage level also augmented from the initial 'no 
damage' to 'initiation of damage', Tribarren damage', 
and eventually 'initiation of destruction', at which 
point the test was terminated and the model rebuilt for 
the following test. The number of wave runs in a test 
varied from 10 to 14. 

The foregoing damage levels provide a good semi- 
quantitative assessment of the breakwater stability 
condition. However, the following nondimensional 



damage parameter is more adequate for the Artificial 
Neural Network model: 



S = 



nU, 



(l-p)b 



where D 50 is the median size of the armour stones, p 
is the porosity of the armour layer, b is the width of 
the model breakwater, and n is the number of units 
displaced after each wave run. In this case, D 50 = 2.95 
cm, p = 0.40, and b = 50 cm. 

The incident wave height was nondimensionalized 
by means of the zero-damage wave height of the SPM 
(1984) formulation, 



H n 



r \ 



WK r 



Jr 



cot a 



\Jw J 



where K D =4 is the stability coefficient, y w =1000 kg/m 3 
is the water density (freshwater used in the laboratory 
tests), and a is the breakwater slope. With these val- 
ues, H = 9.1 cm. The nondimensional incident wave 
height is given by 



H* 



H 



where H stands for the incident wave height. 

Most of the previous applications of Artificial 
Neural Networks in Civil Engineering use multilayer 
feedforward networks trained with the backpropagation 
algorithm (Freeman and Skapura, 1991; Johansson et 
al., 1992), which will also be employed in this study; 
their main advantage lies in their generalisation capa- 
bilities. Thus this kind of network may be used, for 
instance, to predict the armour damage that a model 
breakwater will sustain under certain conditions, even 
if these conditions were not exactly part of the data set 
with which the network was trained. However, the pa- 
rameters describing the conditions (e. gr., wave height 
and period) must be within the parameter ranges of the 
stability tests with which the ANN was trained. 
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In this case, the results from the stability tests of the 
model rubble-mound breakwater described above were 
used to train and test the Artificial Neural Network. The 
eight stability tests comprised 96 wave runs. The input 
to the network was the nondimensional wave height 
(H*) and the relative water depth (kh) of a wave run, 
and the output, the resulting nondimensional damage 
parameter (S). Data from 49 wave runs, corresponding 
to the four stability tests T20, T21, T22, and T23, were 
used for training the network; while data from 46 wave 
runs, pertaining to the remaining four tests (T10, Til, 
T12, and T13) were used for testing it. This distribution 
of data made sure that each of the four wave generation 
periods (Table 1) was present in both the training and 
the testing data sets. 

First, an Artificial Neural Network with 10 sigmoid 
neurons in the hidden layer and a linear output layer was 
trained and tested 10 times. The ANN was trained by 
means of the Bayesian Regularisation method (MacKay, 
1992), known to be effective in avoiding overfitting. 
The average MSE values were 0.2880 considering all 
the data, 0.2224 for the training data set, and 0.3593 
for the testing data set. The standard deviations of 
the MSE values were 5.9651xl0 10 , 9.0962xl0 10 , and 
7.7356xl0~ 10 , for the complete data set, the training 
and the testing data sets, respectively. Increasing the 
number of neural units in the hidden layer to 15 did 
not produce any significant improvement in the aver- 
age MSE values (0.2879, 0.2222 and 0.3593 for all 
the data, the training data set and the testing data set, 
respectively), so the former Artificial Neural Network, 
with 10 neurons in the hidden layer, was retained. 

The following results correspond to a training and 
testing run of this ANN with a global MSE of 0.2513. 
The linear regression analysis indicates that the ANN 
data fit very well to the experimental data over the whole 
range of the nondimensional damage parameter S. In 
effect, the correlation coefficient is 0.983, and the equa- 
tion of the best linear fit, y = 0.938* - 0.00229 , is very 
close to that of the diagonal line y = x (Figure 2). 

The results obtained with the training data set (sta- 
bility tests T20, T21, T22 and T23) show an excellent 
agreement between the ANN model and the physical 
model (Figure 3). In three of the four tests (T20, T22 
and T23) the ANN data mimic the measurements on the 
model breakwater almost to perfection. In test T2 1, the 
physical model experiences a brusque increase in the 
damage level at H* =1.65, which is slightly softened 
by the ANN model. The MSE value is 0.1441. 



Figure 2. Regression analysis. Complete data set. 
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The testing data set comprised also four stability 
tests (T10, Tl 1, T12 and T13). The inherent difficulty 
of the problem is apparent in test Til (Figure 4), in 
which the nondimensional damage parameter (S) does 
not increase in the wave run at H* =1.54, but sud- 
denly soars by about 100% in the next wave run, at 
H* =1.65. Such differences from one wave run to the 
next are practically impossible to capture by the ANN 
model, given that the inputs to the ANN model either 
vary only slightly, by less than 7% in this case (the 
nondimensional wave height, ff *) or do not vary at all 
(the relative water depth, /c/i). It should be remembered 
that, when computing the damage after a given wave 
run, the ANN does not have any information about the 
damage level before that wave run, unlike the physical 
model. Yet the ANN performs well, yielding an MSE 
value of 0.3678 with the testing data set. 



FUTURE TRENDS 

In this study, results from stability tests carried out with 
regular waves were used. Irregular wave tests should 
also be analyzed by means of Artificial Intelligence, 
and it is the authors' intention to do so in the future. 
Breakwater characteristics are another important aspect 
of the problem. The ANN cannot extrapolate beyond 
the ranges of wave and breakwater characteristics on 
which it was trained. The stability tests used for this 
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Figure 3. ANN (U) and physical model results (o) for the stability tests T20, T21, T22 and T23 (training data 
set) 
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Figure 4. ANN (\J) and physical model results (o) for the stability tests T10, Til, T12 and T13 (testing data 
set) 
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study considered one model breakwater; further tests 
involving physical models with other geometries and 
materials should be undertaken. Once the potential of 
Artificial Neural Networks to model the behaviour of 
a rubble-mound breakwater subject to wave action has 
been proven, a virtual laboratory could be constructed 
with the results from these tests. 



CONCLUSION 

This article shows that Artificial Neural Networks are 
capable of modelling the behaviour of a model rubble- 
mound breakwater in the face of energetic waves. This 
is a very complex problem for a number of reasons. 
In the first place, the hydrodynamics of waves break- 
ing on a slope are not well known, so much so that a 
detailed characterization of the motions of the water 
particles is not possible for the time being, and may 
remain so in the future due to the chaotic nature of the 
processes involved. Second, in the case of a rubble- 
mound breakwater the problem is further compounded 
by the porous nature of the structure, which brings about 
a complex wave-structure interaction in which the flux 
of energy carried by the incident wave is distributed 
into the following processes: (i) wave reflection; (ii) 
wave breaking on the slope; (iii) wave transmission 
through the porous medium; and (iv) dissipation. The 
subtle interplay between all these processes means that 
it is not possible to study one of them without taking 
the others into account. Third, the porous medium itself 
is of a stochastic nature: no two rubble-mound break- 
waters can be said to be identical. This complexity has 
precluded up to now the development of a numerical 
model which can reliably analyse the forces acting on 
the armour layer units and hence the stability situation 
of the breakwater. As a consequence, physical model 
tests are a necessity whenever a major rubble-mound 
structure is envisaged. 

Notwithstanding the difficulty of the problem, the 
Artificial Neural Network used in this work has been 
shown to reproduce very closely the physical model 
results. Thus, an Artificial Neural Network can con- 
stitute, once properly trained and validated, a virtual 
laboratory. Testing a breakwater in this virtual labora- 
tory is much quicker and far less expensive that testing 
a physical model of the same structure in a laboratory 
wave flume. 
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KEY TERMS 

Armour Damage: Extraction of stones or concrete 
units from the armour layer by wave action. 

Armour Layer: Outer layer of a rubble-mound 
breakwater, consisting of heavy stones or concrete 
blocks. 

Artificial Neural Networks: Interconnected set 
of many simple processing units, commonly called 
neurons, that use a mathematical model representing 
an input/output relation. 

Backpropagation Algorithm: Supervised learn- 
ing technique used by ANNs that iteratively modifies 
the weights of the connections of the network so the 
error given by the network after the comparison of the 
outputs with the desired one decreases. 

Breakwater: Coastal structure built for sheltering 
an area from waves, usually for loading or unloading 
vessels. 

Reflection: The process by which the energy of the 
incoming waves is returned seaward. 

Significant Wave Height: In wave record analysis, 
the average height of the highest one-third of a selected 
number of waves. 
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INTRODUCTION 

This article describes the most prominent approaches 
to apply artificial intelligence technologies to infor- 
mation retrieval (IR). Information retrieval is a 

key technology for knowledge management. It deals 
with the search for information and the representation, 
storage and organization of knowledge. Information 
retrieval is concerned with search processes in which a 
user needs to identify a subset of information which is 
relevant for his information need within a large amount 
of knowledge. The information seeker formulates a 
query trying to describe his information need. The query 
is compared to document representations which were 
extracted during an indexing phase. The representations 
of documents and queries are typically matched by a 
similarity function such as the Cosine. The most similar 
documents are presented to the users who can evaluate 
the relevance with respect to their problem (Belkin, 
2000). The problem to properly represent documents 
and to match imprecise representations has soon led to 
the application of techniques developed within Artificial 
Intelligence to information retrieval. 



BACKGROUND 

In the early days of computer science, information 
retrieval (IR) and artificial intelligence (AI) developed 
in parallel. In the 1980s, they started to cooperate and 
the term intelligent information retrieval was coined 
for AI applications in IR. In the 1990s, information 
retrieval has seen a shift from set based Boolean 
retrieval models to ranking systems like the vector 
space model and probabilistic approaches. These 
approximate reasoning systems opened the door for 
more intelligent value added components. The large 
amount of text documents available in professional 
databases and on the internet has led to a demand for 
intelligent methods in text retrieval and to considerable 
research in this area. The need for better preprocessing 
to extract more knowledge from data has become an 



important way to improve systems. Off the shelf ap- 
proaches promise worse results than systems adapted 
to users, domain and information needs. Today, most 
techniques developed in AI have been applied to re- 
trieval systems with more or less success. When data 
from users is available, systems use often machine 
learning to optimize their results. 

Artificial Intelligence Methods in 
Information Retrieval 

Artificial intelligence methods are employed throughout 
the standard information retrieval process and for 
novel value added services. The first section gives a 
brief overview of information retrieval. The subsequent 
sections are organized along the steps in the retrieval 
process and give examples for applications. 

Information Retrieval 

Information retrieval deals with the storage and 
representation of knowledge and the retrieval of 
information relevant for a specific user problem. The 
information seeker formulates a query trying to de- 
scribe his information need. The query is compared 
to document representations. The representations 
of documents and queries are typically matched by a 
similarity function such as the Cosine or the Dice coef- 
ficient. The most similar documents are presented to 
the users who can evaluate the relevance with respect 
to their problem. 

Indexing usually consists of the several phases. 
After word segmentation, stopwords are removed. 
These common words like articles or prepositions 
contain little meaning by themselves and are ignored 
in the document representation. Second, word forms 
are transformed into their basic form, the stem. During 
the stemming phase, e.g. houses would be transformed 
into house. For the document representation, different 
word forms are usually not necessary. The importance 
of a word for a document can be different. Some words 
better describe the content of a document than others. 
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This weight is determined by the frequency of a stem 
within the text of a document (Savoy, 2003). 

In multimedia retrieval, the context is essential 
for the selection of a form of query and document 
representation. Different media representations may 
be matched against each other or transformations may 
become necessary (e.g. to match terms against pictures 
or spoken language utterances against documents in 
written text). 

As information retrieval needs to deal with 
vague knowledge, exact processing methods are 
not appropriate. Vague retrieval models like the 
probabilistic model are more suitable. Within these 
models, terms are provided with weights corresponding 
to their importance for a document. These weights 
mirror different levels of relevance. 

The result of current information retrieval systems 
are usually sorted lists of documents where the top 
results are more likely to be relevant according to the 
system. In some approaches, the user can judge the 
documents returned to him and tell the systems which 
ones are relevant for him. The system then resorts 
the result set. Documents which contain many of the 
words present in the relevant documents are ranked 
higher. This relevance feedback process is known to 
greatly improve the performance. Relevance feedback 
is also an interesting application for machine learning. 
Based on a human decisions, the optimization step can 
be modeled with several approaches, e.g. with rough 
sets (Singh & Dey 2005). In Web environments, a click 
is often interpreted as an implicit positive relevance 
judgment (Joachims & Radlinski, 2007). 

Advanced Representation Models 

In order to represent documents in natural language, the 
content of these documents needs to be analyzed. This 
is a hard task for computer systems. Robust semantic 
analysis for large text collections or even multime- 
dia objects has yet to be developed. Therefore, text 
documents are represented by natural language terms 
mostly without syntactic or semantic context. This is 
often referred to as the bag-of-words approach. These 
keywords or terms can only imperfectly represent an 
object because their context and relations to other 
terms are lost. 

However, great progress has been made and systems 
for semantic analysis are getting competitive. Advanced 
syntactic and semantic parsing for robust processing 



of mass data has been derived from computational 
linguistics (Hartrumpf, 2006). 

For application and domain specific knowledge, 
another approach is taken to improve the representation 
of documents. The representation scheme is enriched 
by exploiting knowledge about concepts of the domain 
(Lin & Demner-Fushman, 2006). 

Match Between Query and Document 

Once the representation has been derived, a crucial 
aspect of an information retrieval system is the 
similarity calculation between query and document 
representation. Most systems use mathematical simi- 
larity functions such as the Cosine. The decision for 
a specific function is based on heuristics or empirical 
evaluations. Several approaches use machine learning 
for long term optimization of the matching between 
term and document. E.g. one approach applies genetic 
algorithm to adapt a weighting function to a collection 
(Almeida et al., 2007). 

Neural networks have been applied widely in IR. 
Several network architectures have been applied for 
retrieval tasks, most often the so-called spreading activa- 
tion networks are used. Spreading activation networks 
are simple Hopfield-style networks, however, they do 
not use the learning rule of Hopfield networks. They 
typically consist of two layers representing terms and 
documents. The weights of connections between the 
layers are bi-directional and initially set according to 
the results of the traditional indexing and weighting 
algorithms (Belkin, 2000). The neurons corresponding 
to the terms of the user's query are activated in the term 
layer and activation spreads along the weights into 
the document layer and back. Activation represents 
relevance or interest and reaches potentially relevant 
terms and documents. The most highly activated docu- 
ments are presented to the user as result. A closer look 
at the models reveals that they very much resemble 
the traditional vector space model of Information 
Retrieval (Mandl, 2000). It is not until after the second 
step that associative nature of the spreading activation 
process leads to results different from a vector space 
model. The spreading activation networks successfully 
tested with mass data do not take advantage of this 
associative property. In some systems the process is 
halted after only one step from the term layer into the 
document layer, whereas others make one more step 
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back to the term layer to facilitate learning (Kwok & 
Grunfeld, 1996). 

Queries in information retrieval systems are 
usually short and contain few words. Longer queries 
have a higher probability to achieve good results. As a 
consequence, systems try to add good terms to a query 
entered by a user. Several techniques have been applied. 
Either these terms are taken from top ranked documents 
or terms similar to the original ones are used. Another 
technique is to use terms from documents from the same 
category. For this task, classification algorithms from 
machine learning are used (Sebastiani, 2002). 

Link analysis applies well known measures from 
bibliometric analysis to the Web. The number links 
pointing to a Web page is used as an indicator for its 
quality (Borodin et al., 2005). PageRank assigns an 
authority value to each Web page which is primarily a 
function of its back links. Additionally, it assumes that 
links from pages with high authority should be weighed 
higher and should result in a higher authority for the 
receiving page. To account for the different values 
each page has to distribute, the algorithm is carried 
out iteratively until the result converges (Borodin et 
al., 2005). Machine Learning approaches complement 
link analysis. Decisions of humans about the quality 
of Web pages are used to determine design features of 
these pages which are good indicators of their quality. 
Machine learning models are applied to determine the 
quality of pages not judged yet (Mandl, 2006, Marti 
& Hearst, 2002). 

Learning from users has been an important strategy 
to improve systems. In addition to the content, artificial 
intelligence methods have been used to improve the 
user interface. 

Value Added Components for User 
Interfaces 

Several Researchers have implemented information 
retrieval systems based on the Kohonen self organiz- 
ing map (SOM), a neural network model for unsuper- 
vised classification. They provide an associative user 
interface where neighborhood of documents expresses 
a semantic relation. Implementations for large collec- 
tions can be tested on the internet (Kohonen, 1998). 
The SOM consists of a usually two-dimensional grid 
of neurons, each associated with a weight vector. Input 
documents are classified according to the similarity 
between the input pattern and the weight vectors, and, 



the algorithm adapts the weights of the winning neuron 
and its neighbor. In that way, neighboring clusters have 
a high similarity. 

The information retrieval applications of SOMs 
classify documents and assign the dominant term as 
name for the cluster. For real world large scale col- 
lections, one two-dimensional grid is not sufficient. It 
would be either too big or each node would contain 
too many documents consequently. Neither would be 
helpful for users, therefore, a layered architecture is 
adopted. The highest layer consists of nodes which 
represent clusters of documents. The documents of 
these nodes are again analyzed by a SOM. For the 
user, the system consists of several two-dimensional 
maps of terms where similar terms are close to each 
other. After choosing one node, he may reach another 
two-dimensional SOM. 

The information retrieval paradigm for the SOM is 
browsing and navigating between layers of maps. The 
SOM seems to be a very natural visualization. However, 
the SOM approach has some serious drawbacks. 

The interface for interacting with several layers 
of maps makes the system difficult to browse. 
Users of large text collections need primarily 
search mechanisms which the SOM itself does 
not offer. 

The similarity of the document collection is 
reduced to two dimensions omitting many po- 
tentially interesting aspects. 
The SOM unfolds its advantages for human- 
computer-interaction better for a small number 
of documents. A very encouraging application 
would be the clustering of the result set. The 
neurons would fit on one screen, the number of 
terms would be limited and therefore, the reduc- 
tion to two dimensions would not omit so many 
aspects. 

User Classification and Personalization 

Adaptive information retrieval approaches intend to 
tailor the results of a system to one user and his inter- 
ests and preferences. The most popular representation 
scheme relies on the representation scheme used in 
information retrieval where a document-term-matrix 
stores the importance or weight of each term for each 
document. When a term appears in a document, this 
weight should be different form zero. User interest can 
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also be stored like a document. Then the interest is a 
vector of terms. These terms can be ones that a user 
has entered or selected in a user interface or which the 
system has extracted from documents for which the 
user has shown interest by viewing or downloading 
them (Agichtein et al., 2006). 

An example for such a system is UCAIR which 
can be installed as a browser plugin. UCAIR relies 
on a standard web search engine to obtain a search 
result and a primary ranking. This ranking is now 
being modified by re-ranking the documents based 
on implicit feedback and a stored user interest profile 
(Shen et al., 2005). 

Most systems use this method of storing the user 
interest in a term vector. However, this method has 
several drawbacks. The interest profile may not be stable 
and the user may have a variety of diverging interests 
for work and leisure which are mixed in one profile. 

Advanced individualization techniques personal- 
ize the underlying system functions. The results of 
empirical studies have shown that relevance feedback 
is an effective technique to improve retrieval quality. 
Learning methods for information retrieval need to 
extend the range of relevance feedback effects beyond 
the modification of the query in order to achieve long- 
term adaptation to the subjective point of view of the 
user. The mere change of the query often results in 
improved quality; however, the information is lost after 
the current session. 

Some systems change the document representation 
according to the relevance feedback information. In 
a vector space metaphor, the relevant documents are 
moved toward the query representation. This approach 
also comprises some problems. Because only a fraction 
of the documents are affected by the modifications, the 
basic data from the indexing process is changed to a 
somewhat heterogeneous state. The original indexing 
result is not available anymore. 

Certainly, this technique is inadequate for fusion 
approaches where several retrieval methods are com- 
bined. In this case, several basic representations would 
need to be changed according to the influence of the 
corresponding methods on the relevant documents. 
The indexes are usually heterogeneous, which is often 
considered an advantage of fusion approaches. A high 
computational overload would be the consequence. 

The MIMOR (Multiple Indexing and Method-Obj ect 
Relations) approach does not rely on changes to the 
document or the query representation when processing 



relevance feedback information for personalization. 
Instead, it focuses on the central aspect of a retrieval 
function, the calculation of the similarity between docu- 
ment and query. Like other fusion methods, MIMOR 
accepts the result of individual retrieval systems like 
from a black box. These results are fused by a linear 
combination which is stored during many sessions. The 
weights for the systems experience a change through 
learning. They adapt according to relevance feedback 
information provided by users and create a long-term 
model for future use. That way, MIMOR learns which 
systems were successful in the past (Mandl & Womser- 
Hacker, 2004). 

FUTURE TRENDS 

Information retrieval systems are applied in more and 
more complex and diverse environments. Searching 
e-mail, social computing collections and other specific 
domains pose new challenges which lead to innovative 
systems. These retrieval applications require thorough 
and user oriented evaluation. New evaluation measures 
and standardized test collections are necessary to 
achieve reliable evaluation results. 

In user adaptation, recommendation systems are an 
important trend for future improvement. Recommenda- 
tion systems need to be seen in the context of social 
computing applications. System developers face the 
growth of user generated content which allows new 
reasoning methods. 

New application like question answering relying on 
more intelligent processing can be expected to gain more 
market share in the near future (Hartrumpf, 2006) 



CONCLUSION 

Knowledge management is of main importance for 
the information society. Documents written in natural 
language contain an important share of the knowl- 
edge available. Consequently, retrieval is crucial for 
the success of knowledge management systems. Al 
technologies have been widely applied in retrieval 
systems. Exploiting knowledge more efficiently is a 
major research field. In addition, user oriented value 
added systems require intelligent processing and ma- 
chine learning in many forms. 

An important future trend for Al methods in IR will 
be the context specific adaptation of retrieval methods. 
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Machine learning can be applied to find optimized 
functions for collections or queries. 
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KEY TERMS 

Adaptation: Adaptation is a process of modification 
based on input or observation. An information system 
should adapt itself to the specific needs of individual 
users in order to produce optimized results. 

Indexing: Indexing means the assignment of terms 
(words) which represent a document in an index. In- 
dexing can be carried out manually or automatically. 
Automatic indexing requires the elimination of stop- 
words and stemming. 

Information Retrieval: Information retrieval is 
concerned with the representation and knowledge and 
subsequent search for relevant information within these 
knowledge sources. Information retrieval provides the 
technology behind search engines. 

Link Analysis : The links between pages on the web 
are a large knowledge source which is exploited by link 
analysis algorithms for many ends. Many algorithms 



similar to PageRank determine a quality or authority 
score based on the number of in-coming links of a 
page. Furthermore, link analysis is applied to identify 
thematically similar pages, web communities and other 
social structures. 

Recommendation Systems: Actions or content is 
suggested to the user based on past experience collected 
from other users. Very often, documents are recom- 
mended based on similarity profiles between users. 

Term Expansion: Terms not present in the original 
query to an information retrieval system entered by the 
user are added automatically. The expanded query is 
then sent to the system again. 

Weighting: Weighting determines the importance 
of a term for a document. Weights are calculated using 
many different formulas which consider the frequency 
of each term in a document and in the collection as 
well as the length of the document and the average or 
maximum length of any document in the collection. 
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INTRODUCTION 

Professionals of the medical radiology area depend 
directly on the process of decision making in their 
daily activities. This process is mainly based on the 
analysis of a great amount of information obtained for 
the evaluation of radiographic images. 

Some studies demonstrate the great capacity of 
Artificial Neural Networks (ANN) in support systems 
for diagnosis, mainly in applications as pattern clas- 
sification. 

The objective of this article is to present the de- 
velopment of an ANN-based system, verifying its 
behavior as a feature extraction and dimensionality 
reduction tool, for recognition and characterization 
of patterns, for posterior classification in normal and 
abnormal patterns. 



BACKGROUND 

The computer-aided diagnosis (CAD) is considered one 
of the main areas of research of the medical images 
and radiological diagnosis (Doi, 2005). 

According to Giger (2002) "In the future, is probable 
that all the medical images have some form of executed 
CAD to benefit to the results and the patient cares". 

The diagnosis of the radiologist is normally based 
on qualitative interpretation of the analyzed data, that 
can be influenced and be harmed by many factors, as 
low quality of the image, visual fatigue, distraction, 
overlapping of structures, amongst others (Azevedo- 
Marques, 2001). Moreover, the human beings possess 
limitations in its visual ability, which can harm the 
analysis of a medical image, mainly in the detection 
of determined presented patterns (Giger, 2002). 

Research demonstrates that when the analysis is 
carried out by two radiologists, the diagnosis sensitivity 
is significantly increases (Thurfjell et a/., 1994). In this 
direction, the CAD can be used as a second specialist, 



when providing the computer reply as a second opinion 
(Doi, 2005). 

Many works analyze the radiologist performance 
front the use of a CAD systems, of which we detach 
the research of Jiang et al. (2001) and Fenton et al. 
(2007). 

In the development of CAD systems, techniques 
from two computational areas are normally used: 
Computer Vision and Artificial Intelligence. 

From the area of Computer Vision, techniques 
of image processing for enhancement, segmentation 
and feature extraction are used (Azevedo-Marques, 
2001). 

The enhancement objectives to improve an image 
to make it more appropriate for a specific application 
(Gonzalez & Woods, 2001). In applications with digital 
medical images, the enhancement is important to facili- 
tate the visual analysis on the part of the specialist. 

The segmentation is the stage where the image is 
subdivided in parts or constituent objects (Gonzalez 
& Woods, 2001). The result of the segmentation is 
a set of objects that can be analyzed and quantified 
individually, representing determined characteristic 
of the original image. 

The final stage involved in image processing is the 
feature extraction, that it basically involves the quanti- 
fication of elements that compose segmented objects of 
the original image, such as size, contrast and form. 

After concluded this first part, the quantified attri- 
butes are used for the classification of the structures 
identified in the image, normally using methods of 
Artificial Intelligence. According to Kononenko (200 1 ), 
the use of Artificial Intelligence in the support to the 
diagnosis is efficient, for allowing a complex data 
analysis of simple and direct form. 

Many methods and techniques of Artificial Intel- 
ligence can be applied in this stage, normally with the 
objective to identify and to separate the patterns in 
distinct groups (Theodorides & Koutroumbas, 2003), 
for example, normal and abnormal patterns. According 
to Kahn Jr (1994), among the main techniques, can be 
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cited: rule-based reasoning, artificial neural networks, 
bayesian networks, case-based reasoning. To these, 
the statistical methods, the genetic algorithms and the 
decision trees can be added. 

A problem that reaches most of the applications 
of pattern recognition is the data dimensionality. The 
dimensionality is associated with the number of at- 
tributes that represent a pattern, that is, the dimension 
of the search space. When this space contains only the 
most relevant attributes, the classification process is 
faster and consumes little processing resources (Jain 
et a/., 2000), and also allows for greater precision of 
the classifier. 

In the problems of medical image processing, the 
importance of the dimensionality reduction is accen- 
tuated; therefore normally the images to be processed 
are composed of a very great number of pixels, used 
as basic attributes in the classification. 

The feature extraction is a common boarding to ef- 
fect the dimensionality reduction. Of general form, an 
extraction algorithm creates a new set of attributes from 
transformations or combinations of the original set. 

Some methods are studied with the intention to 
promote the feature extraction and, consequently, the 
dimensionality reduction, such as statistical methods, 
methods based on the signal theory, and artificial neural 
networks (Verikas & Bacauskiene, 2002). 

As example of the use of artificial neural networks 
in the support to the medical diagnosis, we can cite 
the research of Papadopoulos et al. (2005) and Andre 
& Rangayan (2006). 



Feature Extraction with ANNs 

The feature extraction with the use of Artificial Neural 
Networks functions basically as a selection of charac- 
teristics that represent the original data set. 

This selection of characteristics is related to a pro- 
cess in which a data set is transformed into a space 
of characteristics that, in theory, accurately describes 
the same information as the original space of the data. 
However, the transformation is projected in such a way 
that the data set is represented by a reduced effective 
characteristic, keeping most of the intrinsic informa- 
tion to the data, that is, the original data set suffers a 
significant dimensionality reduction (Haykin, 1999). 

The dimensionality reduction is extremely useful 
in applications that involve digital image processing, 
which normally depend on a very high number of data 
points to be manipulated. 

In summary, the feature extraction with ANNs trans- 
forms the original set of pixels into a map, of reduced 
dimensions, that represents the original image without 
a significant loss of information. 

For this function, self-organizing neural networks 
are normally used, as for example, the Kohonen's Self- 
Organizing Map (SOM). 

The self-organizing map searches ways to transform 
one determined pattern into a bi-dimensional map, 
following a certain topological order (Haykin, 1999). 
The elements that compose the map are distributed in 
an only layer, having formed a grid (Figure 1). 



MAIN FOCUS OF THE ARTICLE 

In this paper, we also present a proposal of use of 
Artificial Intelligence in the stage of feature extrac- 
tion, substituting the traditional techniques of image 
processing. 

Traditionally, the feature extraction is carried out on 
the basis of statistical or spectral techniques, which re- 
sult in, for example, texture or geometric attributes. 

After these attributes are obtained, techniques of 
Artificial Intelligence are applied in the pattern clas- 
sification. 

Our proposal is the use of ANN also for feature 
extraction. 



Figure 1. Illustrative representation of a Kohonen's 
self-organizing map 
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All the elements of the grid receive the input signal 
of all variables, associated to its respective weights. 
The calculation of its value of exit is carried through 
by one determined function, on the basis of the weights 
of the connections, and it is used to identify the win- 
ning element. 

Mathematically, each element of the grid is repre- 
sented by a vector composed of the weights of con- 
nection, with the same dimension of the input space, 
that is, the amount of elements that compose the vector 
corresponds to the amount of input variables of the 
problem (Haykin, 1999). 

Methodology 

As application example, a self-organizing neural 
network for the feature extraction of images of chest 
radiographs was developed, objectifying the charac- 
terization of normal and abnormal patterns. 

Each original image was divided in 12 parts, hav- 
ing as base the anatomical division normally used in 
the diagnosis of the radiologist. Each part is formed 
by approximately 250,000 pixels. 

With the use of the proposal self-organizing network, 
a reduction for only 240 representative elements was 
obtained, with satisfactory results in the final pattern 
classification. 

A detailed description of the methodology can be 
found in (Ambrosio, 2007; Azevedo-Marques et a/., 
2007). 



FUTURE TRENDS 

The developed study shows the possibilities of appli- 
cation of the self-organizing networks in the feature 
extraction and dimensionality reduction; however, 
other types of neural networks can also be used for this 
purpose. New studies need to be carried out to compare 
the results and adequacy of the methodology. 



CONCLUSION 

The contribution of the Information Technology is 
undeniable as support tool to the medical decision 
making. The Artificial Intelligence presents itself as 
a great source of important techniques to be used in 
this direction. 



It can be evidenced that the technique of artificial 
neural networks highlights its great versatility and 
robustness, providing sufficiently satisfactory results, 
when used and implemented well. 

The use of an automatic system of image analysis 
can assist the radiologist, when used as a tool of ' second 
opinion', or second reading, in the analysis of possible 
inexact cases. 

It is also observed that the use of the proposed 
methodology represents a significant profit in the im- 
age processing of chest radiographs, for its peculiar 
characteristics. 
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KEY TERMS 

Computer- Aided Diagnosis: Research area that en- 
close the development of computational techniques and 
procedures for aid to the health professionals in process 
of decision making for the medical diagnosis. 

Dimensionality Reduction: Finding a reduced data 
set, with the capacity of mapping a bigger set. 

Feature Extraction: Finding of representative 
features of a determined problem from samples with 
different characteristics. 

Medical Images: Images generated in special 
equipment, used for aid to the medical diagnosis. Ex.: 
X-Ray images, Computer Tomography, Magnetic 
Resonance Images. 

Pattern Recognition: Research area that enclose 
the development of methods and automatized tech- 
niques for identification and classification of samples 
in specific groups, in accordance with representative 
characteristics. 

Radiological Diagnosis: Medical diagnosis based 
in analysis and interpretation of patterns observed in 
medical images. 

Self-Organizing Maps: Category of algorithms 
based on artificial neural networks that searches, by 
means of self-organization, to create a map of char- 
acteristics that represents the involved samples in a 
determined problem. 
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INTRODUCTION 

In their heyday, artificial neural networks promised a 
radically new approach to cognitive modelling. The 
connectionist approach spawned a number of influential, 
and controversial, cognitive models. In this article, we 
consider the main characteristics of the approach, look 
at the factors leading to its enthusiastic adoption, and 
discuss the extent to which it differs from earlier com- 
putational models. Connectionist cognitive models 
have made a significant impact on the study of mind. 
However connectionism is no longer in its prime. 
Possible reasons for the diminution in its popularity 
will be identified, together with an attempt to identify 
its likely future. 

The rise of connectionist models dates from the 
publication in 1986 by Rumelhart and McClelland, 
of an edited work containing a collection of connec- 
tionist models of cognition, each trained by exposure 
to samples of the required tasks. These volumes set 
the agenda for connectionist cognitive modellers and 
offered a methodology that subsequently became the 
standard. Connectionist cognitive models have since 
been produced in domains including memory retrieval 
and category formation, and (in language) phoneme 
recognition, word recognition, speech perception, ac- 
quired dyslexia, language acquisition, and (in vision) 
edge detection, object and shape recognition. More 
than twenty years later the impact of this work is still 
apparent. 



BACKGROUND 

Seidenberg and McClelland's (1989) model of word 
pronunciation is a well-known connectionist example. 
They used backpropagation to train a three-layer 
network to map an orthographic representation of 
words and non-words onto a distributed phonological 
representation, and an orthographic output represen- 
tation. The model is claimed to provide a good fit to 



experimental data from human subjects. Humans can 
make rapid decisions about whether a string of letters 
is a word or not, (in a lexical decision task), and can 
readily pronounce both words and non- words. The 
time they take to do both is affected by a number of 
factors, including the frequency with which words 
occur in language, and the regularity of their spelling. 
The trained artificial neural network outputs both a 
phonological and an orthographic representation of 
its input. The phonological representation is taken as 
the equivalent to pronouncing the word or non-word. 
The orthographic representation, and the extent to 
which it duplicates the original input, is taken to be 
the equivalent of the lexical decision task 

The past tense model (McClelland & Rumelhart, 
1986) has also been very influential. The model mirrors 
several aspects of human learning of verb endings. It 
was trained on examples of the root form of the word 
as input, and of the past-tense form as output. Each 
input and output was represented as a set of context- 
sensitive phonological features, coded and decoded by 
means of a fixed encoder/decoder network. A goal of 
the model was to simulate the stage-like sequences of 
past tense learning shown by humans. Young children 
first correctly learn the past tense of a few verbs, both 
regular (e.g. looked) and irregular (e.g. went, or came). 
In stage 2 they often behave as though they have inferred 
a general rule for creating the past tense, (adding -ed to 
the verb stem). But they often over-generalise this rule, 
and add -ed to irregular verbs (e.g corned). There is a 
gradual transition to the final stage in which they learn 
to produce the correct past tense form of both regular 
and exception words. Thus their performance exhibits 
a U-shaped function for irregular verbs (initially cor- 
rect, then often wrong, then correct again). 

The model was trained in stages on 506 English 
verbs. First, it was trained on 10 high frequency verbs 
(regular, and irregular). Then medium frequency verbs 
(mostly regular) were introduced and trained for a 
number of epochs. A dip in performance on the ir- 
regular verbs occurred shortly after the introduction 
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of the medium frequency verbs - a dip followed by 
a gradual improvement that resembled the U-shaped 
curve found in human performance. 



THE STRENGTHS AND LIMITATIONS 
OF CONNECTIONIST COGNITIVE 
MODELLING 

The models outlined above exhibit five typical features 
of connectionist models of cognition: (i) They provide 
an account that is related to and inspired by the opera- 
tions of the brain; (ii) They can be used both to model 
mental processes, and to simulate the actual behaviour 
involved; (iii) They can provide a 'good fit' to the data 
from psychology experiments; (iv) The model, and its 
fit to the data, is achieved without explicit programming 
and (v) They often provide new accounts of the data. 
We discuss these features in turn. 

First there is the idea that a connectionist cognitive 
model is inspired by, and related to, the way in which 
brains work. Connectionism is based on both the al- 
leged operation of the nervous system and on distrib- 
uted computation. Neuron-like units are connected by 
means of weighted links, in a manner that resembles 
the synaptic connections between neurons in the brain. 
These weighted links capture the knowledge of the 
system; they may be arrived at either analytically or 
by "training" the system with repeated presentations of 
input-output training examples. Much of the interest in 
connectionist models of cognition was that they offered 
a new account of the way in which knowledge was 
represented in the brain. For instance, the behaviour 
of the past tense learning model can be described in 
terms of rule following - but its underlying mechanism 
does not contain any explicit rules. Knowledge about 
the formation of the past tense is distributed across the 
weights in the network. 

Interest in brain-like computing was fuelled by a 
growing dissatisfaction with the classical symbolic 
processing approach to modelling mind and its rela- 
tionship to the brain. Even though theories of symbol 
manipulation could account for many aspects of hu- 
man cognition, there was concern about how such 
symbols might be learnt and represented in the brain. 
Functionalism (Putnam, 1975) explicitly insisted that 
details about how intelligence and reasoning were actu- 
ally implemented were irrelevant. Concern about the 



manipulation of meaningless, ungrounded symbols is 
exemplified by Searle's Chinese Room thought-experi- 
ment (1980). Connectionism, by contrast, offered an 
approach that was based on learning, made little use 
of symbols, and was related to the way in which the 
brain worked. Arguably, one of the main contributions 
that connectionism has made to the study and under- 
standing of mind has been the development of a shared 
vocabulary between those interested in cognition, and 
those interested in studying the brain. 

The second and third features relate to the way in 
which artificial neural nets can both provide a model 
of a cognitive process and simulate a task, and provide 
a good fit to the empirical data. In Cognitive Psychol- 
ogy, the emphasis had been on building models that 
could account for the empirical results from human 
subjects, but which did not incorporate simulations of 
experimental tasks. Alternatively, in Artificial Intel- 
ligence, models were developed that performed tasks 
in ways that resembled human behaviour, but which 
took little account of detailed psychological evidence. 
However, as in the two models described here, con- 
nectionist models both simulated the performance of 
the human tasks, and were able to fit the data from 
psychological investigations. 

The fourth feature is that of achieving the model and 
the fit to the data without explicit handwiring. It can 
be favourably contrasted to the symbolic programming 
methodology of Artificial Intelligence, where the model 
is programmed step by step, leaving room for ad hoc 
modifications and kludges. The fifth characteristic is the 
possibility of providing a novel explanation of the data. 
In their model of word pronunciation, Seidenberg and 
McClelland showed that their artificial neural network 
provided an integrated (single mechanism) account of 
data on both regular and exception words where pre- 
viously the old cognitive modelling conventions had 
forced an explanation in terms of a dual route. Similarly, 
the past-tense model was formulated as a challenge to 
rule-based accounts: although children's performance 
can be described in terms of rules, it was claimed that 
the model showed that the same behaviour could be 
accounted for by means of an underlying mechanism 
that does not use explicit rules. 

In its glory days, connectionism's claims about 
novel explanations of stimulated much debate. There 
was also much discussion of the extent to which con- 
nectionism could provide an adequate account of 
higher mental processes. Fodor and Pylyshyn (1988) 
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mounted an attack on the representational adequacy 
of connectionism. Connectionists retaliated, and in 
papers such as van Gelder's (1990) the argument was 
made that not only could they provide an account of 
the structure sensitive processes underlying human 
language, but that connectionism did so in a novel 
manner: the eliminative connectionist position. 

Now the dust has subsided, connectionist models 
do not seem as radically different to other modelling 
approaches as was once supposed. It was held that 
one of their strengths was their ability to model mental 
processes, simulate behaviour, and provide a good fit 
to data from psychology experiments without being 
explicitly programmed to do so. However, there is 
now greater awareness that decisions about factors such 
as the architecture of the net, the form its representa- 
tions will take, and even the interpretation of its input 
and output, are tantamount to a form of indirect, or 
extensional programming. 

Controlling the content, and presentation of the 
training sample, is an important aspect of extensional 
programming. When Pinker and Prince (1988) criticised 
the past tense model, an important element of their 
criticisms was that the experimenters had unrealisti- 
cally tailored the environment to produce the required 
results, and that the results were an artifact of the train- 
ing data. Although the results indicated a U-shaped 
curve in the rate of acquisition, as occurs with children, 
Pinker and Prince argued that this curve occurred only 
because the net was exposed to the verbs in an unreal- 
istically structured order. Further research has largely 
answered these criticisms, but it remains the case that 
selection of the input, and control of the way that it 
is presented to the net, affects what the net learns. A 
similar argument can be made about the selection of 
input representations. 

In summary: there has been debate about the novelty 
of connectionism, and its ability to account for higher 
level cognitive processing. There is however general 
acknowledgement that the approach made a lasting con- 
tribution by indicating how cognitive processes could 
be implemented at the level of neurons. Despite this, 
the connectionist approach to cognitive modelling is 
no longer as popular as it once was. Possible reasons 
are considered below: 

Difficult challenges: A possible reason for the 
diminished popularity of artificial neural nets is 
that as Elman (2005) suggests, "we have arrived 



at the point where the easy targets have been 
identified but the tougher problems remain". 
Difficult challenges to be met include the idea of 
scaling up models to account for wider ranges of 
phenomena, and building models that can account 
for more than one behaviour. 
Greater understanding: As a result of our greater 
understanding of the operation and inherent 
limitations of artificial neural nets, some of their 
attraction has faded with their mystery. They 
have become part of the arsenal of statistical 
methods for pattern recognition, and much recent 
research on artificial neural networks has focused 
more on questions about whether the best level 
of generalisation has been efficiently achieved, 
than on modelling cognition. 

Also there is greater knowledge of the limitations of 
artificial neural nets, such as the problem of the "cata- 
strophic interference" associated with backpropagation. 
Backpropagation performs impressively when all of the 
training data are presented to the net on each training 
cycle, but its results are less impressive when such 
training is carried out sequentially and a net is fully 
trained on one set of items before being trained on a 
new set. The newly learned information often interferes 
with, and overwrites, previously learned information. 
For instance, McCloskey and Cohen (1989) used back- 
propagation to train a net on the arithmetic problem of 
+ 1 addition (e.g. 1+1, 2+1, . . ., 9+1). They found that 
when they proceeded to train the same net to add 2 to 
a given number, it "forgot" how to add 1 . Sequential 
training of this form results in catastrophic interfer- 
ence. Sharkey and Sharkey (1995) demonstrated that 
it is possible to avoid the problem if the training set is 
sufficiently representative of the underlying function, or 
there are enough sequential training sets. In terms of this 
example, if the function to be learned is both + 1 and + 
2, then training sets that incorporate enough examples 
of each could lead to the net learning to add either 1 or 
2 to a given number. However, this is at the expense 
of being able to discriminate between those items that 
have been learned from those that have not. 

This example is related to another limitation of ar- 
tificial neural nets: their inability to extrapolate beyond 
their training set. Although humans can readily grasp 
the idea of adding one to any given number, it is not so 
straightforward to train the net to extrapolate beyond 
the data on which it is trained. It has been argued 



163 



Artificial Neural Networks and Cognitive Modelling 



(Marcus, 1998) that this inability of artificial neural 
nets trained using backpropagation to generalise beyond 
their training space provides a major limitation to the 
power of connectionist nets: an important one, since 
humans can readily generalise universal relationships to 
unfamiliar instances. Clearly there are certain aspects 
of cognition, particularly those to do with higher level 
human abilities, such as their reasoning and planning 
abilities, that are more difficult to capture within con- 
nectionist models. 

Changing zeitgest: There is now an increased 
interest in more detailed modelling of brain func- 
tion, and a concomitant dissatisfaction with the 
simplicity of cognitive models that often consisted 
of "a small number of neurons connected in three 
rows" (Hawkins, 2004). Similarly, there is greater 
impatience with the emphasis in connectionism 
on the biologically implausible backpropaga- 
tion learning algorithm. At the same time, there 
is greater awareness of the role the body plays 
in cognition, and the relationships between the 
body, the brain, and the environment (e.g. Clark, 
1999). Traditional connectionist models do not 
fit easily with the new emphasis on embodied 
cognition (e.g. Pfeifer and Scheier, 1999). 



used as the basis for robotic controllers (e.g. Nolfi and 
Floreano, 2000). Such changes will ensure a future 
for connectionist modeling, and stimulate a new set of 
questions about the emergence of cognition in response 
to an organism's interaction with the environment. 



CONCLUSION 

In this article we have described two landmark con- 
nectionist cognitive models, and considered their char- 
acteristic features. We outlined the debates over the 
novelty and sufficiency of connectionism for modelling 
cognition, and argued that in some respects the approach 
shares features with the modelling approaches that 
preceded it. Reasons for a gradual waning of interest 
in connectionism were identified, and possible futures 
were discussed. Connectionism has had a strong impact 
on cognitive modelling, and although its relationship to 
the brain is no longer seen as a strong one, it provided 
an indication of the way in which cognitive processes 
could be accounted for in the brain. It is argued here 
that although the approach is no longer ubiquitous, it 
will continue to form an important component of future 
cognitive models, as they take account of the interactions 
between thought, brains and the environment. 
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KEY TERMS 

Chinese Room: In Searle's thought experiment, 
he asks us to imagine a man sitting in a room with a 
number of rule books. A set of symbols is passed into 
the room. The man processes the symbols according 
to the rule books, and passes a new set of symbols 
out of the room. The symbols posted into the room 
correspond to a Chinese question, and the symbols he 
passes out are the answer to the question, in Chinese. 
However, the man following the rules has no knowl- 
edge of Chinese. The example suggests a computer 
program could similarly follow rules in order to answer 
a question without any understanding. 

Classical Symbol Processing: The classical view 
of cognition was that it was analogous to symbolic 
computation in digital computers. Information is repre- 
sented as strings of symbols, and cognitive processing 
involves the manipulation of these strings by means 
of a set of rules. Under this view, the details of how 
such computation is implemented are not considered 
important. 

Connectionism: Connectionism is the term used to 
describe the application of artificial neural networks to 
the study of mind. In connectionist accounts, knowledge 
is represented in the strength of connections between 
a set of artificial neurons. 

Eliminative Connectionism: The eliminative 
connectionist is concerned to provide an account of 
cognition that eschews symbols, and operates at the 
subsymbolic level. For instance, the concept of "dog" 
could be captured in a distributed representation as a 
number of input features (e.g. four-footed, furry, barks 
etc) and would then exist in the net in the form of the 
weighted links between its neuron like units. 

Generalisation: Artificial neural networks, once 
trained, are able to generalise beyond the items on 
which they were trained and to produce a similar 
output in response to inputs that are similar to those 
encountered in training 

Implementational Connectionism: In this less 
extreme version of connectionism, the goal is to find 
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a means of implementing classical symbol processing 
using artificial networks - and to find a way of account- 
ing for symbol processing at the level of neurons. 

Lexical Decision: Lexical decision tasks are a 
measure devised to look at the processes involved in 
word recognition. A word or pseudoword (a meaning- 
less string of letters, conforming to spelling rules) is 
presented, and the reader is asked to press a button to 
indicate whether the display was a word or not. The 
time taken to make the decision is recorded in mil- 
liseconds. The measure can provide an indication of 
various aspects of word processing - for instance how 
familiar the word is to the reader. 
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INTRODUCTION 



BACKGROUND 



More than 50 years ago connectionist systems (CSs) 
were created with the purpose to process information 
in the computers like the human brain (McCulloch 
& Pitts, 1943). Since that time these systems have 
advanced considerably and nowadays they allow us to 
resolve complex problems in many disciplines (clas- 
sification, clustering, regression, etc.). Butthis advance 
is not enough. There are still a lot of limitations when 
these systems are used (Dorado, 1999). Mostly the 
improvements were obtained following two different 
ways. Many researchers have preferred the construc- 
tion of artificial neural networks (ANNs) based in 
mathematic models with diverse equations which lead 
its functioning (Cortes &Vapnik, 1995; Haykin, 1999). 
Otherwise other researchers have pretended the most 
possibly to make alike these systems to human brain 
(Rabunal, 1999; Porto, 2004). 

The systems included in this article have emerged 
following the second way of investigation. CSs which 
pretend to imitate the neuroglial nets of the brain are 
introduced. These systems are named Artificial Neu- 
roGlial Networks (ANGNs) (Porto, 2004). These CSs 
are not only made of neuron, but also from elements 
which imitate glial neurons named astrocytes ( Araque, 
1999). These systems, which have hybrid training, have 
demonstrated efficacy when resolving classification 
problems with totally connected feed-forward multi- 
layer networks, without backpropagation and lateral 
connections. 



The ANNs or CSs emulate the biological neural net- 
works in that they do not require the programming of 
tasks but generalise and learn from experience. Current 
ANNs are composed by a set of very simple processing 
elements (PEs) that emulate the biological neurons and 
by a certain number of connections between them. 

Until now, researchers that pretend to emulate the 
brain, have tried to represent in ANNs the importance 
the neurons have in the Nervous System (NS). How- 
ever, during the last decades research has advanced 
remarkably in the Neuroscience field, and increasingly 
complex neural circuits, as well as the Glial System 
(GS), are being observed closely. The importance of 
the functions of the GS leads researchers to think that 
their participation in the processing of information in the 
NS is much more relevant than previously assumed. In 
that case, it may be useful to integrate into the artificial 
models other elements that are not neurons. 

Since the late 80s, the application of innovative and 
carefully developed cellular and physiological tech- 
niques (such as patch-clamp, fluorescent ion-sensible 
images, confocal microscopy and molecular biology) 
to glial studies has defied the classic idea that astro- 
cytes merely provide a structural and trophic support 
to neurons and suggests that these elements play more 
active roles in the physiology of the Central Nervous 
System. 

New discoveries are now unveiling that the glia 
is intimately linked to the active control of neural 
activity and takes part in the regulation of synaptic 
neurotransmission (Perea & Araque, 2007). Abundant 
evidence has suggested the existence of bidirectional 
communication between astrocytes and neurons, and 
the important active role of the astrocytes in the NS's 
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physiology (Araque etal., 2001; Perea&Araque, 2005). 
This evidence has led to the proposal of a new concept 
in synaptic physiology, the tripartite synapse, which 
consists of three functional elements: the presynaptic 
and postsynaptic elements and the surrounding astro- 
cytes (Araque et al., 1999). The communication between 
these three elements has highly complex characteristics, 
which seem to reflect more reliably the complexity of 
the information processing between the elements of 
the NS (Martin & Araque, 2005). 

So there is no question about the existence of com- 
munication between astrocytes and neurons (Perea & 
Araque, 2002). In order to understand the motives of 
this reciprocated signalling, we must know the differ- 
ences and similarities that exist between their proper- 
ties. Only a decade ago, it would have been absurd 
to suggest that these two cell types have very similar 
functions; now we realise that the similarities are 
striking from the perspective of chemical signalling. 
Both cell types receive chemical inputs that have an 
impact on the ionotropic and metabotropic receptors. 
Following this integration, both cell types send signals 
to their neighbours through the release of chemical 
transmittors. Both the neuron-to-neuron signalling 
and the neuron-to-astrocyte signalling show plastic 
properties that depend on the activity (Pasti et al., 
1997). The main difference between astrocytes and 
neurons is that many neurons extend their axons over 
large distances and conduct action potentials of short 
duration at high speed, whereas the astrocytes do not 
exhibit any electric excitability but conduct calcium 
spikes of long duration (tens of seconds) over short 
distances and at low speed. The fast signalling, and the 
input/output functions in the central NS that require 
speed, seem to belong to the neural domain. But what 
happens with slower events, such as the induction of 
memories, and other abstract processes such as thought 
processes? Does the signalling between astrocytes con- 
tribute to their control? As long as there is no answer 
to these questions, research must continue; the present 
work offers new ways to advance through the use of 
Artificial Intelligence (Al) techniques. 

Therefore not only it is pretended to improve the 
CSs incorporating elements imitating astrocytes, but it 
is also intended to benefit Neuroscience with the study 
of brain circuits since other point of view, the AL 

The most recent works in this area are presented 
by Porto et al (Porto et al., 2007; Porto et al., 2005; 
Porto, 2004). 



MAIN FOCUS OF THE ARTICLE 

All the design possibilities, for the architecture as well 
as for the training process of an ANN, are basically 
oriented towards minimising the error level or reducing 
the system's learning time. As such, it is in the optimi- 
sation process of a mechanism, in case the ANN, that 
we must find the solution for the many parameters of 
the elements and the connections between them. 

Considering possible future improvements that 
optimize an ANN with respect to minimal error and 
minimal training time, our models will be the brain 
circuits, in which the participation of elements of the 
GS is crucial to process the information. In order to 
design the integration of these elements into the ANN 
and elaborate a learning method for the resulting 
ANGN that allows us to check whether there is an 
improvement in these systems, we have analysed the 
main existing training methods that will be used for 
the elaboration. We have analysed Non-Supervised 
and Supervised Training methods, and other methods 
that use or combine some of their characteristics and 
complete the analysis: Training by Reinforcement, 
Hybrid Training and Evolutionary Training. 

Observed Limitations 

Several experiments with ANNs have shown the exis- 
tence of conflicts between the functioning of the CS and 
biological neuron networks, due to the use of methods 
that did not reflect reality. For instance, in the case of a 
multilayer perceptron, which is a simple CS, the synaptic 
connections between the PEs have weights that can be 
excitatory or inhibitory, whereas in the natural NS, are 
the neurons that seem to represent these functions, not 
the connections; recent research (Perea&Araque, 2007) 
indicates that the cells of the GS, more concretely the 
astrocytes, also play an important role. 

Another limitation concerns the learning algorithm 
known as "Backpropagation", which implies that the 
change of the connections value requires the back- 
wards transmission of the error signal in the ANN. 
It was traditionally assumed that this behaviour was 
impossible in a natural neuron, which, according to 
the "dynamic polarisation" theory of Ramon y Cajal 
(1911), is unable to efficiently transmit information 
inversely through the axon until reaching the cellu- 
lar soma; new research however has discovered that 
neurons can send information to presynaptic neurons 
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under certain conditions, either by means of existing 
mechanisms in the dendrites or else through various 
interventions of glial cells such as astrocytes. 

If the learning is supervised, it implies the existence 
of an "instructor", which in the context of the brain 
means a set of neurons that behave differently from 
the rest in order to guide the process. At present, the 
existence of this type of neurons is biologically inde- 
monstrable, but the GS seems to be strongly implied in 
this orientation and may be the element that configures 
an instructor that until now had not been considered. 

It is in this context that the present study analyses 
to what extent the latest discoveries in Neuroscience 
(Araque et al., 2001; Perea & Araque, 2002) contribute 
to these networks: discoveries that proceed from cere- 
bral activity in areas that are believed to be involved 
in the learning and processing of information (Porto 
et al., 2007). 

Artificial Neuroglial Networks 

Many researchers have used the current potential of 
computers and the efficiency of computational models to 
elaborate "biological" computational models and reach 
a better understanding of the structure and behaviour 
of both pyramidal neurons, which are believed to be 
involved in learning and memory processes (LeRay 
et al., 2004, Fernandez et al., 2007), and astrocytes 
(Porto, 2004; Perea & Araque, 2002). These models 
have provided a better understanding of the causes and 
factors that are involved in the specific functioning of 
biological circuits. The present work will use these new 
insights to progress in the field of Computer Sciences 
and more concretely in AL 



Figure 1. Artificial NeuroGlial network scheme 
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We present ANGNs (figure 1) that include both 
artificial neurons and processing control elements 
that represent the astrocytes, and whose functioning 
follows the steps that were successfully applied in the 
construction and use of CS: design, training, testing 
and execution. 

Also, since the computational studies of the learn- 
ing with ANNs are beginning to converge towards 
evolutionary computation methods (Dorado, 1999), we 
will combine the optimisation in the modification of 
the weights (according to the results of the biological 
models) with the use of Genetic Algorithms (GAs) in 
order to find the best solution for a given problem. This 
evolutionary technique was found to be very efficient in 
the training phase of the CS (Rabunal, 1998), because it 
helps to adapt the CS to the optimal solution according 
to the inputs that enter the system and the outputs that 
must be produced by the system. This adaptation phe- 
nomenon takes place in the brain thanks to the plasticity 
of its elements and may be partly controlled by the GS; 
it is for this reason that we consider the GA as a part 
of the "artificial glia". The result of this combination 
is a hybrid learning method (Porto, 2004). 

The design of the ANGNs is oriented towards clas- 
sification problems that are solved by means of simple 
networks, i.e. multilayer networks, although future 
research may lead to the design of models in more 
complex networks. It seems a logical approach to start 
the design of these new models with simple ANNs, and 
to orientate the latest discoveries on astrocytes and 
pyramidal neurons in information processing towards 
their use in classification networks, since the control 
of the reinforcement or weakening of the connections 
in the brain is related to the adaptation or plasticity 
of the connections, which lead to the generation of 
activation ways. This process can therefore improve 
the classification of the patterns and their recognition 
by the ANGN. 

A detailed description of the functioning of the 
ANGNs and results with these systems can be found 
in Porto et al (Porto, 2004; Porto et al., 2005; Porto 
et al., 2007). 



FUTURE TRENDS 

We keep on analysing other synaptic modification 
possibilities based on brain behaviour to apply them 
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to new CSs which can solve simple problems with 
simple architectures. 

Moreover, given that it has been proved that the glia 
acts upon complex brain circuits, and that the more an 
individual's brain has developed, the more glia he has 
in his nervous system (following what Cajal said one 
hundred years ago (Ramon y Cajal, 1911), we are ap- 
plying the observed brain behaviour to more complex 
network architectures. Particularly after having checked 
that a more complex network architecture achieved 
better results in the problem presented here. 

For the same reason, we intend to analyse how the 
new CSs solve complex problems, for instance time 
processing ones where totally or partially recurrent 
networks would play a role. These networks could 
combine their functioning with this new behaviour. 



CONCLUSION 

This article presents CSs composed by artificial neu- 
rons and artificial glial cells. The design of artificial 
models did not aim at obtaining a perfect copy of the 
natural model but a series of behaviours whose final 
functioning is approached to it as much as possible. 
Nevertheless, a close similarity between both is indis- 
pensable to improve the output, and may result in more 
"intelligent" behaviours. 

The synaptic modifications introduced in the CSs, 
and based on the modelled brain processes enhance 
the training of multilayer architectures. 

We must remember that the innovation of the ex- 
isting ANNs models towards the development of new 
architectures is conditioned by the need to integrate the 
new parameters into the learning algorithms so that they 
can adjust their values. New parameters, that provide 
the process element models of the ANNs with new 
functionalities, are harder to come by than optimizations 
of the most frequently used algorithms that increase the 
calculations and basically work on the computational 
side of the algorithm. The ANGNs integrate new ele- 
ments and thanks to a hybrid method this approach did 
not complicate the training process. 

The research with these ANGNs benefits AI because 
it can improve information processing capabilities 
which would allow us to deal with a wider range of 
problems. Moreover, this has indirectly benefited 
Neuroscience since experiments with computational 
models that simulate brain circuits pave the way for 



difficult experiments carried out in laboratories, as well 
as providing new ideas for research. 
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KEY TERMS 

Artificial Neural Network: A network of many 
simple processors ("units" or "neurons") that imitates 
a biological neural network. The units are connected 
by unidirectional communication channels, which 
carry numeric data. Neural networks can be trained 
to find nonlinear relationships in data, and are used 
in applications such as robotics, speech recognition, 
signal processing or medical diagnosis. 

Astrocytes: Astrocytes are a sub-type of the glial 
cells in the brain. They perform many functions, in- 
cluding the formation of the blood-brain barrier, the 
provision of nutrients to the nervous tissue, and play a 
principal role in the repair and scarring process in the 
brain. They modulate the synaptic transmission and 



recently their crucial role in the information process- 
ing was discovered. 

Backpropagation Algorithm: A supervised 
learning technique used for training ANNs, based on 
minimising the error obtained from the comparison 
between the outputs that the network gives after the 
application of a set of network inputs and the outputs 
it should give (the desired outputs). 

Evolutionary Computation: Solution approach 
guided by biological evolution, which begins with 
potential solution models, then iteratively applies al- 
gorithms to find the fittest models from the set to serve 
as inputs to the next iteration, ultimately leading to a 
model that best represents the data. 

Genetic Algorithms: Genetic algorithms (GAs) are 
adaptive heuristic search algorithm premised on the 
evolutionary ideas of natural selection and genetic. The 
basic concept of GAs is designed to simulate processes 
in natural system necessary for evolution, specifically 
those that follow the principles first laid down by Charles 
Darwin of survival of the fittest. As such they represent 
an intelligent exploitation of a random search within a 
defined search space to solve a problem. 

Glial Sytem: Commonly called glia (greek for 
"glue"), are non-neuronal cells that provide support 
and nutrition, maintain homeostasis, form myelin, 
and participate in signal transmission in the nervous 
system. In the human brain, glia cells are estimated to 
outnumber neurons by about 10 to 1. 

Hybrid Training: learning method that combines 
the supervised and unsupervised training of Connec- 
tionist Systems. 

Synapse: Specialized junctions through which 
the cells of the nervous system signal to each other 
and to non-neuronal cells such as those in muscles or 
glands. 
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INTRODUCTION 

Data mining is a field encompassing study of the tools 
and techniques to assist humans in intelligently analyz- 
ing (mining) mountains of data. Data mining has found 
successful applications in many fields including sales 
and marketing, financial crime identification, portfolio 
management, medical diagnosis, manufacturing process 
management and health care improvement etc.. 

Data mining techniques can be classified as either 
descriptive or predictive techniques. Descriptive 
techniques summarize / characterize general proper- 
ties of data, while predictive techniques construct a 
model from the historical data and use it to predict 
some characteristics of the future data. Association 
rule mining, sequence analysis and clustering are key 
descriptive data mining techniques, while classification 
and regression are predictive techniques. 

The objective of this article is to introduce the 
problem of association rule mining and describe some 
approaches to solve the problem. 



BACKGROUND 

Association rule mining, one of the fundamental tech- 
niques of data mining, aims to extract interesting cor- 
relations, frequent patterns or causal structures among 
sets of items in data. 

An association rule is of the form X— > Y and indi- 
cates that the presence of items in the antecedent of rule 
(X) implies the presence of items in the consequent of 
rule (Y). For example, the rule {PC, Color Printer} — > 
{computer table} implies that people who purchase a 
PC (personal computer) and a color printer also tend to 
purchase a computer table. These associations, however, 
are not based on the inherent characteristics of a domain 
(as in a functional dependency) but on the co-occur- 



rence of data items in the dataset. Thus, association 
rule mining is a totally data driven technique. 

Association rules have been successfully employed 
in numerous applications, some of which are listed 
below: 

1. Retail market analysis: Discovery of association 
rules in retail data has been applied in departmental 
stores for floor planning, stock planning, focused 
marketing campaigns for product awareness, 
product promotion and customer retention. 

2. Web association analysis: Association rules in 
web usage mining have been used to recommend 
related pages, discover web pages with common 
references, web pages with maj ority of same links 
(mirrors) and predictive caching. The knowledge 
is applied to improve web site design and speed 
up searches. 

3. Discovery of linked concepts: Words or sentences 
that appear frequently together in documents are 
called linked concepts. Association rules can be 
used to discover linked concepts which further 
lead to the discovery of plagiarized text and the 
development of ontologies etc.. 

The problem of association rule mining (ARM) was 
introduced by Agrawal et al. (1993). Large databases of 
retail transactions called the market basket databases, 
which accumulate in departmental stores provided 
the motivation of ARM. The basket corresponds to a 
physical retail transaction in a departmental store and 
consists of the set of items a customer buys. These 
transactions are recorded in a database called the 
transaction database. The goal is to analyze the buying 
habits of customers by finding associations between the 
different items that customers place in their "shopping 
baskets". The discovered association rules can also be 
used by management to increase the effectiveness of 
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Figure 1. Boolean database and corresponding transaction database 
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advertising, marketing, inventory management and 
reduce the associated costs. 

The authors in (Agrawal et al., 1993) worked on 
a boolean database of transactions. Each record cor- 
responds to a customer basket and contains transac- 
tion identifier (TID), transaction details and a list of 
items bought in the transaction. The list of items is 
represented by a boolean vector with a one denoting 
presence of corresponding item in the transaction and 
zero marking the absence. Figure 1 shows the boolean 
database of five transactions and the corresponding 
transaction database. 

The problem of finding association rules is to find 
the columns with frequently co-occurring ones in the 
boolean database. However, most of the algorithms 
for ARM use the form of transaction database shown 
on the right. We give the mathematical formulation of 
the problem below. 



MATHEMATICAL FORMULATION OF 
THE ARM PROBLEM 

Let I = {i p i 2 , . . ., ij denote a set of items and D des- 
ignate a database of N transactions. A transaction Te 
D is a subset of I i.e. T c I and is associated with a 
unique identifier TID. 

An itemset is a collection of one or more items. X 
is an itemset if X cz I. A transaction is said to contain 
an itemset X if X cz T. A k-itemset is an itemset that 
contains k items. 

An association rule is of the form X—> Y [Support, 
Confidence] where X cz I, Yd I, X n Y = 0, and Support 
and Confidence are rule evaluation metrics. 

Support of an itemset X is the fraction of transac- 
tions that contain X. It denotes the probability that a 
transaction contains X. 



Support (X) = P(X) = 



No. of transactions containing X 
Total number of transactions in D 

Support of a ruleX— ► 7 in D is 's' if s% of transac- 
tions in D contain IuY, and is computed as: 

Support (X-+ y) = P(IuY) = 



No. of transactions containin XuY 

Total number of transactions in D 

Support indicates the extent of prevalence of a rule . A 
rule with low support value represents a rare event. 

Confidence of a rule measures its strength and 
provides an indication of the reliability of prediction 
made by the rule. A rule X— ► 7 has a confidence 'c' in 
D if c% of transactions in D that contain X also contain 
Y. It is computed as the conditional probability that Y 
occurs in a transaction, given X is present in the same 
transaction, i.e. 

Confidence (X-+Y) = P(Y/X) = 

?(XUY) 



P(X) 



Example 1: Consider the example database shown 
in Figure 2 (a). Here, I = {A, B, C, D, E}. Figures 2 
(b) and 2 (c) show the computation of support and 
confidence for a rule. 
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Figure 2(a). Example of database transactions 
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The second criterion of interestingness is the strength 
of the rule. Arule which has confidence greater than the 
user specified minimum confidence threshold (minconf) 
is interesting to the user. 

Confidence, however, can sometimes be misleading. 
For instance, the confidence of a rule can be high even if 
antecedent and consequent of the rule are independent. 
Lift (also called Interest) and Conviction of a rule are 
other commonly used measures for rule interestingness 
(Dunham, 2002). A suitable measure of rule strength 
needs to be identified for an application. 



Figure 2(b). Itemsets of size one and two 
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Figure 2(c). Computation of support and confidence 



Sup(A): 4 (80%), 
Supp(C): 5 (100%) 
Sup (AB): 1 (20%) 
Sup(AC): 4 (80%), 
Sup (ABC): 1 (20%), 
Sup (ABCD): (0%) 
Sup(ABCDE): (0%) 

Confidence (A->C) = 4/4 (100%) 
Confidence (C->A) = 4/5 (80%) 



With n items in I, the total number of possible 
association rules is very large (0(3 n )). However, the 
majority of these rules (associations) existing in D are 
not interesting for the user. Interestingness measures 
are employed to reduce the number of rules discovered 
by the algorithms. Foremost criterion of interestingness 
is the high prevalence of both the item-sets and rules, 
which is specified by the user as minimum support 
value. An itemset (rule) whose support is greater than 
or equal to a user specified minimum support (minsup) 
threshold is called a Frequent Itemset (rule). 



MINING OF ASSOCIATION RULES 

Since an association rule is an implication among 
itemsets, the brute-force approach to association rule 
generation requires examining relationships between 
all possible item-sets. Such a process would typically 
involve counting the co-occurrence of all itemsets in 
D and subsequently generating rules from them. 

For n items, there are 2 n possible itemsets that need 
to be counted. This may require not only prohibitive 
amount of memory, but also complex indexing of 
counters. Since the user is interested only in frequent 
rules, one needs to count only the frequent itemsets 
before generating association rules. Thus, the problem 
of mining association rules is decomposed into two 
phases: 

Phase I: Discovery of frequent itemsets 

Phase II: Generation of rules from the frequent itemsets 
discovered in phase I. 

Various algorithms for ARM differ in their ap- 
proaches to optimize the time and storage requirement 
for counting in the first phase. Despite reducing the 
search space by imposing minsup constraint, frequent 
itemset generation is still a computationally expensive 
process. The Anti-monotone property of support is an 
important tool to further reduce the search space and 
is used in most association rule algorithms (Agrawal 
et al., 1994). According to this property, support of an 
itemset never exceeds the support of any of its sub- 
sets i.e. if X is an itemset, then for each of its subsets 
Y, sup(X) <= sup(Y). This property makes the set of 
frequent itemsets downward closed. The task of rule 
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Figure 3. Generation of association rules from frequent 
itemsets (Adapted from Dunham, 2002) 

Algorithm: Gen-Rules 

Input: F - Set of frequent itemsets 

minconf- Minimum confidence threshold 
Output: R - set of strong association rules 
Process: 

R = 
For each f e F do 

For each x cz f such that x ^ do 
If sup(f)/sup(x) >= minconf 
R = Ru{x^(f-x)} 



generation is trivial once the set of frequent itemsets 
has been discovered. 

Apriori algorithm uses a level-wise approach for 
discovering frequent itemsets (Agrawal et al., 1994). 
This algorithm makes multiple scans of the data. In 
each scan the algorithm generates and counts potential 
frequent itemsets (candidates). Candidate itemsets of 
size k+1 are generated by joining the frequent itemsets 
of size k, and are pruned exploiting the anti-monotonic 
property. At the end of the scan the set of frequent 
itemsets is confirmed. The strategy incurs massive 
I/O costs and has motivated a number of variants 
which reduce the number of counters (candidates) 
and the scans (Brin et al., 1997; Lin et al., 1998). The 
FP-growth algorithm (Han et al., 2000) uses a pattern 
growth method to avoid the costly process of candidate 
generation and testing, and therefore is considered as 
a major milestone. The algorithm uses an in-memory, 
tree based data structure called FP-Tree to store the 
database in a compressed form. Two passes are made 
over the database to construct the prefix tree, which is 
then used to generate frequent patterns. Lattice based 
approaches have also been proposed for efficient dis- 
covery of frequent itemsets (Zaki et al., 1998). 

The task of generating rules (Phase II) from a given 
frequent itemsets is rather straight forward (Figure 3). 
It boils down to enumerating all subsets of a frequent 
itemset and finding the ratio of support of each itemset 
w.r.t. the support of each of its subsets. The subsets, 
whose ratio is more than the minconf, qualify as strong 
rules and are reported to the user. 



TYPES OF ASSOCIATION RULES 

Popularity of ARM has led to its application on many 
types of data and application domains. Some specialized 
kinds of association rules have been reported in data 
mining literature (Han & Kamber 2006). We describe 
here some of the important types: 

1 . Quantitative Association Rules: Quantitative 
association rules introduced the notion of mining 
associations between numeric attributes in addi- 
tion to the categorical ones (Srikant et al., 1996). 
The quantitative association rules can be derived 
by either mapping the attribute values to a set of 
consecutive integers or partitioning them into 
intervals. Example of a quantitative association 
rule: 



Age (x, "30.. .39") 
"car") [1%, 75%] 



salary (x, "42...48K") -> buys(x, 



2. Multilevel Association Rules: Many applications 
have an inherent taxonomy (concept hierarchy) 
among items (Figure 4). In such scenarios, asso- 
ciation rules can be generated at different levels 
of the taxonomy, to capture knowledge at differ- 
ent levels of abstraction. As we move down the 
hierarchy (from generalized towards specialized 
values), the support of rules decreases and some 
rules may become uninteresting. However while 
climbing up the hierarchy some new rules may 
become interesting. 

This gives rise to multilevel association rules 
(Han et al., 1999), which capture linkages between 
items or attributes at different levels of abstrac- 
tion, i.e. at different levels of concept hierarchy. 
For example, the rule {Brown Bread} — > {Coke} 
captures linkages between items at different lev- 
els of concept hierarchy. Multilevel rules can be 
mined either using the same support thresholds 
or different thresholds at different levels. 
Selecting an appropriate level of abstraction can 
have a significant impact on the usefulness of the 
knowledge generated. Mining at a very high level 
of abstraction is likely to generate over-general- 
ized rules, which may not be interesting, while 
mining at too low an abstraction level may lead 
to generation of highly specific rules. 
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Figure 4. Concept hierarchy for food 




3. Multidimensional Association Rules: Associa- 
tion rules essentially expose intra-record linkages. 
If links exist between different values of the same 
attribute (dimension), the association is called 
single-dimensional association rule, while if link- 
ages span multiple dimensions, the associations 
are called multi-dimensional association rules 
(Kamber et al., 1997). For example the following 
single dimensional rule contains conjuncts in both 
antecedent and consequent from a single attribute 
'buys'. 

Buys (X, "rnilk") and Buys (X, "butter") — > Buys(X, 
"bread") 

Multidimensional associations contain two or 
more predicates, for example 



Age (X," 19-25") and Occupation(X, 
Buys(X, "coke") 



"student") 



4. Association Rules in Streaming Data: The 

aforementioned types of rules pertain to static 
data repositories. Mining rules from data streams, 
however, is far more complex (Hidber, 1999). 
The field is relatively recent and poses additional 
challenges such as one-look characteristics of 



data, limited main memory, continual updating 
of data and online processing so as to be able to 
make decisions on the fly (Cheng et al. 2007). 

5. Association Rule Mining with Multiple Mini- 
mum Supports: In large departmental stores 
where the number of items is very large, using 
single support threshold sometimes does not yield 
interesting rules. Since buying trends for different 
items often vary vastly, the use of different support 
thresholds for different items is recommended 
(Liu et al., 1999). 

6. Negative Association Rules: Typical associa- 
tion rules discover correlations between items 
that are bought during the transactions and are 
called positive association rules. Negative as- 
sociation rules discover implications in all the 
items, irrespective of whether they were bought 
or not. Negative association rules are useful in 
market-basket analysis to identify products that 
conflict with each other or complement each 
other. The importance of negative association 
rules was highlighted in (Brin et al., 1997; Wu 
et al. 2004). 
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FUTURE TRENDS 

Though association analysis will continue to impact 
various scientific and business spheres through health- 
care databases, financial databases, spatial databases, 
multimedia databases and time series databases etc., 
we investigate the role of association rule mining in 
some promising applications. 

Business Intelligence (BI): BI converts raw data 
into personalized intelligence with the goal of increas- 
ing customer satisfaction, loyalty and product profit- 
ability. It integrates data from multiple sources across 
company, analyzes it and acts promptly on the results 
leading to competitive advantage and timely deploy- 
ment of solutions. 

Association analysis is one of the core tools that 
support BI. In the retail sector association analysis 
provides basis for point-of-sale data analysis, market 
basket analysis, resources and space management op- 
timization. In the banking realm, BI exploits the study 
of associations to perform credit risk analysis, fraud 
detection, and customer retention. The credit companies 
benefit by fraud detection, monitoring buying patterns, 
scoring customer reliability and analyzing cross-sell. 

Stream data mining: In contrast to the data in 
traditional static databases, a data stream is an ordered 
sequence of items that is continuous, unbounded, usu- 
ally comes with high speed and has a data distribution 
that changes with time. As the number of applications 
on mining data streams grows rapidly, there is an in- 
creasing need to perform association rule mining on 
stream data. 

Association rules are employed in the estimation of 
missing data in streams of data generated by sensors 
and frequency estimation of internet packet streams. 
Association rule mining is also useful for monitoring 
manufacturing flows to predict failure or generate 
reports based on web log streams. 

Bioinformatics: Associations are employed in 
bioinformatics databases for identification of co-occur- 
ring gene sequences. They are also used to detect gene 
mutations. A gene is a segment of a DNA molecule that 
contains all the information required for the synthesis 
of a product. Any change in the DNA sequence of a 
gene (For example: Insertion, Deletion, Insertion/De- 
letion, Complex and Multiple Substitution) is termed 
gene mutation. The discovery of interesting association 
relationships among huge amount of gene mutations is 



important because it can help in determining the cause 
of mutation in tumours and diseases. 



CONCLUSION 

Association rule mining is an important data mining 
technique that was introduced to describe the intra 
record links in transactional databases. However, both 
academia and industry have seen the technology spell 
profitability and success due to its simplicity, ease of 
understanding and wide applicability. More than a 
decade later, the technology still promises to be the 
driving force for some new, challenging applications. 
No doubt that association rule mining has been regarded 
as one of the most significant contribution from the 
database community in KDD. 

In this article, we introduced the notion of association 
rules, their applications and presented the mathematical 
formulation for the same. We also presented the major 
milestones in the development history of association 
rules. The general mining strategy, the related problems 
such as vastness of search space and the interestingness 
measures were also discussed. Finally, different types 
of association rules were described. 
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KEY TERMS 

Association Rule: An implication of the form X — » 
Y, in a transactional data base with parameters support 
(s) and confidence (c). X and Y are set of items, s is 
the fraction of transactions containing XuY and c% 
of transactions containing X also contain Y 

Classification: Data mining technique that con- 
structs a model (classifier) from historical data (train- 
ing data) and uses it to predict the category of unseen 
tuples. 

Clustering: Data mining technique to partition data 
objects into a set of groups such that the intra group 
similarity is maximized and inter group similarity is 
minimized. 

Data Mining: Extraction of interesting, non-trivial 
implicit , previously unknown and potentially useful 
information or patterns from data in large databases. 

Data Stream: A continuous flow of data from a 
data source, e.g., a sensor, stock ticker, monitoring 
device, etc. A data stream is characterized by its un- 
bounded size. 

Descriptive Mining Technique: Data mining tech- 
nique that induces a model describing the characteristics 
of data. These techniques are usually unsupervised and 
totally data driven. 

Predictive Mining Technique: Data mining 
technique that induces a model from historical data in 
supervised manner, and uses the model to predict some 
characteristic of new data. 
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INTRODUCTION 

Classical ciphers are used to encrypt plaintext messages 
written in a natural language in such a way that they are 
readable for sender or intended recipient only. Many 
classical ciphers can be broken by brute-force search 
through the key-space. One of the pertinent problems 
arising in automated cryptanalysis is the plaintext rec- 
ognition. A computer should be able to decide which 
of many possible decrypts are meaningful. This can 
be accomplished by means of a text scoring function, 
based, e.g. on n-grams or other text statistics. A scor- 
ing function can also be used in conjunction with AI 
methods to speedup cryptanalysis. 



rithms is a string of letters from plaintext, or ciphertext 
alphabet respectively. Both, sender as well as receiver, 
uses the same secret key, and the same encryption and 
decryption algorithms. 

Cryptanalysis is a process of key recovery, or plain- 
text recovery without the knowledge of the key. In both 
cases we need a plaintext recognition subroutine which 
evaluates (with some probability) every candidate 
substring, whether it is a valid plaintext or not. Such 
automated text recognition requires an adequate model 
of a used language. 



PLAINTEXT RECOGNITION FOR 
AUTOMATED CRYPTANALYSIS 



BACKGROUND 

Language recognition is a field of artificial intelli- 
gence studying how to employ computers to recog- 
nize language of a text. This is a simple task when 
we have enough amount of text with accents since 
they characterize used language with very high ac- 
curacy. Nowadays there are plenty of toolkits which 
automatically check/correct often both spelling and 
grammatical mistakes and errors. In connection with 
this we recall also the NIST Language Recognition 
Evaluation (LRE-05, LRE-07) as a part of an ongoing 
series of evaluations of language recognition technol- 
ogy. McMahon & Smith (1998) present an overview 
of natural language processing techniques based on 
statistical models. 

We recall some basic notions from cryptography (see 
the article automated cryptanalysis of classical ciphers 
for more details). There is a reversible encryption rule 
(algorithm) how to transform plaintext to the ciphertext, 
and vice-versa. These algorithms depend on a secret 
parameter K called the key. The set of possible keys % 
is called the key-space. Input and output of these algo- 



In the process of automated cryptanalysis we decrypt 
the ciphertext with many possible keys to obtain 
candidate plaintexts. Most of the candidates are incor- 
rect, having no meaning in a natural language. On the 
other hand, even the correct plaintext can be hard to 
recognize and with the wrong recognition routine can 
be missed altogether. 

The basic type of algorithm suitable for automated 
cryptanalysis is a brute force attack. This attack is only 
feasible when key-space is searchable on computational 
resources available to an attacker. The average time 
needed to verify a candidate strongly influences the size 
of searchable key-space. Thus, the plaintext recogni- 
tion is the most critical part of the algorithm from the 
performance point of view. On the other hand, only the 
most complex algorithms achieve really high accuracy 
of the plaintext recognition. Thus the complexity and 
accuracy of plaintext recognition algorithms must be 
carefully balanced. 

A generic brute force algorithm with plaintext 
recognition can be described by the pseudo-code in 
Exhibit A. 
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Exhibit A. 



INPUT: ciphertext string Y = y y± y2- • -yn 

OUTPUT: ordered sequence S of possible plaintexts with their scores 

1. LetS = fi 

2. For each key K e ^,do 

2.1. Let X = cIk( Y) be a candidate plaintext. 

2.2. Compute negative test predicate filter(X). If predicate is true, continue with step 2. 

2.3. Compute fast scoring function fastScore(X). If fastScore(X) < LIMITF, continue with 
step 2. 

2.4. Compute precise scoring function score(X). If score(X) < LIMIT, continue with step 2. 

2.5. Let S = S u {<score(X), X> } 

3. Sort S by key score(X) descending. 

4. Return S. 



Table 1. Performance of the three-layer decryption of a table-transposition cipher using a brute-force search. 
First filter was negative predicate-based, removing all decrypts with first 4 letters not forming a valid n-gram 
(about 90 % of texts were removed). Score was then computed as the count of valid tetragrams in the whole text. 
If this count was lower then given threshold (12), then the text was removed in the score-based filter. Finally, 
remaining texts were scored using the dictionary words. 



Key-space Size 


Negative filter 


Score-based filter 


Remaining 
texts 


Total time 

[s] 


9! 


89.11% 


10.82% 


254 


1.2 


10! 


89.15% 


10.78% 


2903 


5.8 


11! 


88.08% 


8.92% 


239501 


341 


12! 


90.10% 


9.85% 


1193512 


746 



Algorithm integrates the three layers of plaintext 
recognition, namely negative test predicate, fast scor- 
ing function and precise scoring function, as a three- 
layer filter. The final scoring function is also used to 
sort the outputs. First filter should be very fast, with 
very low error probability. Fast score should be easy 
to compute, but it is not required to precisely identify 
the correct plaintext. Correct plaintext recognition is 
the role of precise scoring function. In the algorithm, 
the best score is the highest one. If the score is com- 
puted in the opposite meaning, the algorithm must be 
rewritten accordingly. 

In some cases, we can integrate a fast scoring func- 
tion within the negative test or with the precise scoring, 
leading to two-layer filters, as in (Zajac, 2006a). It 
is also possible to use even more steps of predicate- 
based and score-based filtering, respectively. However, 
experiments show that the proposed architecture of 



three-layers is the most flexible, and more layers can 
even lead to performance decrease. Experimental results 
are shown in Table 1. 

Negative Filtering 

The goal of the negative test predicate is to identify 
candidate texts that are NOT plaintext (with very high 
probability, ideally with certainty). People can clearly 
recognize the wrong text just by looking at it. It is in the 
area of artificial intelligence to implement this ability in 
computers. However, most nowadays AI methods (e.g. 
neural networks) seem to be too slow, to be applicable 
in this stage of a brute-force algorithm, as every text 
must be evaluated with this predicate. 

Most of the methods for fast negative text filter- 
ing are based on prohibited n-grams. As an n-gram 
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we would only consider a sequence of n consecutive 
letters. If the alphabet size is N, then it is possible to 

create N n possible n-grams. For higher n, only a small 
fraction of them can appear in valid text in a given 
language (Zajac, 2006b). By using a lexical tree or 
lookup table, it is easy (and fast) to verify, whether a 
given n-gram is valid or not. Thus a natural test is to 
check every n-gram in the text, whether it is valid or 
not. There are two basic problems arising with this ap- 
proach - the real plaintext can contain (intentionally) 
misspelled, uncommon or foreign words, and thus 
our n-gram database can be incomplete. We can limit 
our test to some specific patterns, e.g. too long run of 
consecutive vowels/consonants. These patterns can be 
checked in time dependent on the plaintext candidate 
length. A filter can also be based on checking only a 
few n-grams on a fixed or random position in the text, 
e.g. the first four letters. 

The rule for rejecting texts should be based on 
the exact type of the cipher we are trying to decipher. 
For example, if the first four letters of the decrypted 
plaintext does not depend on some part of the key, the 
filter based only on their validity would not be effec- 
tive. An interesting question is, whether it is possible 
to create a system which can effectively learn its filter 
rules from existing decrypted texts, even in the process 
of decryption. 

Scoring Functions 

With the negative filter step we can eliminate around 
90% of candidate texts or more. The number of texts to 
be verified is still very huge, and thus we need to apply 
more precise methods of plaintext recognition. We use 
a scoring function that assigns a quantity - score - to 
every text that has survived elimination in previous 
steps. Here the higher score means higher likeness that 
a given text is a valid plaintext. For each scoring func- 
tion we can assign a threshold, and it should be very 
improbable that a valid plaintext have score under this 
threshold. Actual threshold value can either be found 
experimentally (by evaluating large number of real 
texts), or can be based on a statistical analysis. Speed 
of the scoring function can be determined by using 
classical algorithm complexity estimates. Precision 
of the scoring can be defined by means of separation 
of valid and invalid plaintexts, respectively. There is 
a trade-off involved in scoring, as faster scoring func- 
tions are less precise and vice-versa. Thus we apply 



two scoring functions: one that is fast but less precise, 
with lower threshold value, and one that is very precise, 
but harder to compute. 

An example of scoring function distributions can be 
found in Figures 1 and 2. Scoring function in Figure 1 is 
much more precise than in Figure 2, but computational 
time required for evaluation is doubled. Moreover, scor- 
ing function in Figure 1 was created from a reduced 
dictionary fitted to a given ciphertext. Evaluation based 
on a complete dictionary is slower, more difficult to 
implement, and can even be less precise. 

Scoring functions can be based on dictionary words, 
n-grams statistics, or other specific statistics. It is 
difficult to provide a one-fits-all scoring function, as 
decryption process for different cipher types has impact 
on actual scoring function results. E.g. when trying 
to decrypt a transposition cipher, we already know 
which letters appear with which frequency, and thus 
letter frequency based statistics do not play any role in 
scoring. On the other hand they are quite significant for 
substitution ciphers. Most common universal scoring 
functions are (see also Ganesan & Sherman, 1993): 

1 . Number of dictionary words in the text / frac- 
tion of meaningful text 

S coring based on dictionary words is very precise, 
if we have a large enough dictionary. Even if not 
every word in the hidden message is in our dic- 
tionary, it is very improbable that some incorrect 
decryption contains some larger fraction of a text 
composed of dictionary words. Removing short 
words from dictionary can increase the precision. 
Another possibility is to use weights based on word 
length as in (Russell, Clark & Stepney, 2003). 
Dictionary words canbe found using lexical trees. 
In some languages, we should use dictionary of 
word stems, instead of the whole words. Speed 
of evaluation depends on the length of the text 
and the average length of words in dictionary, 
respectively. 

2. Rank distance of n-grams 

Rank of the n-gram is its position depending on 
order based n-gram frequencies (merged for all n 
up to given bound dependent on language). This 
method is used (Cavnar & Trenkle, 1994) in fast 
automated language recognition: compute ranks 
of n-grams of given text and compare it with ranks 
of significant n-grams obtained from large corpus 
of different languages. Correct language should 
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Figure 1. Distribution of score among 91 possible decrypts (table transposition cipher with key given by a per- 
mutation of 9 columns), ciphertext size is 90 characters. Score was computed as a weighted sum of lengths of 
(reduced) dictionary words found in the text. Single highest score belongs to the correct plaintext. 
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have the smallest distance. Even if this method 
can be adapted for plaintext recognition, e.g. by 
creating "random" corpus, or encrypted corpus, 
it does not seem practical. 

3. Frequency distance of n-gram statistics 
Score of the text can be estimated from the dif- 
ference of measured frequency of n-grams and 
estimated frequencies from large corpus (Clark, 
1998; Spillman, Janssen, Nelson &Kepner, 1993). 
We suppose that correct plaintext would have the 
smallest distance from corpus statistics. However 
due to statistical properties of the text, this is not 
always true for short or specific texts. Thus the 
precision of this scoring function is higher for 
longer texts. Speed of the evaluation depends on 
the text size and size of n. 

4. Scoring tables for n-grams 

If we consider the statistics of all n-grams, we 
will see that most n-grams contribute a very 
small value to the final score. We can consider 
contribution of only the most common n-grams. 
For a given language (and a given ciphertext) we 
can prepare a table of n-gram scores with fixed 
number of entries. Score is evaluated as the sum 
of scores assigned to n-grams in a given candidate 
text (Clark & Dawson, 1998). We can assign both 
positive scores for common valid n-grams, and 
negative scores for common invalid/supposedly 
rare n-grams. However, precision of this method 



is very low especially for transposition ciphers. In 
our experiments the scores have normal distribu- 
tion, and usually the correct plaintext does not 
have highest possible values. On the other hand, 
this scoring function is easy to evaluate, and can 
be customized for a given ciphertext. Thus it can 
be used as a fast scoring function fastScore(X). 
5. Index of coincidence (and other similar statis- 
tics) 

Index of coincidence (Friedman 1920), denoted 
by I , is a suitable and fast statistics applicable 
for ciphers that modify the letter frequency. The 
notion comes from probability of the same letter 
occurring in two different texts at the same posi- 
tion. Encrypted texts are considered random, and 
have (normalized) index of coincidence near to 
the 1.0. On the other hand, a plaintext has much 
higher I c near the value expected for a given 
language and alphabet. For English language the 
expected value is 1.73. As with all statistics based 
on scoring function, its precision is influenced 
by the length of the text. Index of coincidence is 
most suitable for polyalphabetic ciphers, where 
encryption depends on the position of the letter in 
the text (e.g. Vigenere cipher). It can be adapted 
to other cipher types (Bagnall, McKeown & Ray- 
ward-Smith, 1997), e.g. for transposition ciphers 
when considering that alphabet is created by all 
possible n-grams with some n > 1. 
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Figure 2. Distribution of score among 91 possible decrypts (table transposition cipher with key given by a per- 
mutation of 9 columns), ciphertext size is 90 characters. Score is the count of digrams with frequency higher 
than 1% (in Slovak language) in a given text. The correct plaintext has scored 73. 
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FUTURE TRENDS 

Language processing and recognition have applications 
in various areas outside cryptanalysis (OCR, automatic 
translation...). Some cryptanalytic techniques can be 
generalized for these fields. E.g. some letters or groups 
of letters are often replaced by another in scanned docu- 
ments. Thus correcting these documents is similar to 
cryptanalysis of randomized substitution ciphers. With 
Artificial Intelligence research new insights can be 
gained into a structure of natural language that can help 
further in cryptanalysis. Cryptanalysis is also strongly 
related to automatic translation efforts. 

Some open problems that need to be addressed 
by language recognition suitable for cryptanalysis of 
classical ciphers are the following: 

How the text recognition should be integrated 
with decryption process to give feedback, e.g. 
on partially decrypted words, to estimate a new 
key, etc. This is especially true, if we use more 
advanced search heuristic than brute-force search 
through the key-space. This can also be viewed as 
a generalization of results of Peleg & Rosenfeld 
(1979). 

How the syntax and semantics of the language 
can help in text recognition and key search, re- 
spectively. 



How various encodings and writing systems in- 
fluence cryptanalysis. Specific issues arise when 
dealing with different writing systems (Atkinson 
1985; August 1989 and 1990). 
How to correctly recognize text with intentional 
misspellings and special code words. 

Another set of problems arises when different natu- 
ral languages are used, like the language recognition, 
specific alphabets, impact of diacritical marks, etc. 
Our research shows that the language of the message 
encrypted by substitution cipher can be recognized even 
without decryption (Zajac, 2006b). It is even possible to 
use dictionary of a different (although similar) language 
in decryption process. It is an interesting research ques- 
tion whether it is possible to create completely general 
language recognition function (or restricted to some 
family of languages) usable for cryptanalysis. 

Plaintext recognition in cryptanalysis can be also 
seen as a specific information retrieval problem (Man- 
ning, Raghavan & Schiitze 2008). Multilanguage in- 
formation retrieval is targeting similar problems to the 
problems presented above (see e.g. McNamee, 2006). 
The research in these areas can clearly influence each 
other in the future. 
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CONCLUSION 

This article summarizes the usage and restrictions for 
language processing in the context of cryptanalysis of 
classical ciphers. Their application usually differs ac- 
cording to a character of the analyzed cipher systems, 
although we have presented some common techniques 
that can be easily adapted for a specific situation. Most 
cryptanalytic attacks require very fast language recog- 
nition, but on the other hand, great speed often causes 
inaccurate results, up to the point of unrecognizable 
decrypts. The role of the Artificial Intelligence research 
is to find faster and more precise language predicates 
and combine them to a useful plaintext recognition 
system. 
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KEY TERMS 

Brute-Force Attack: Exhaustive cryptanalytic 
technique that searches the whole key-space to find 
the correct key. 

Candidate Text: The text that was obtained by ap- 
plication of decryption algorithm on ciphertext using 
some key k e %. If k is the correct key (or the equiva- 
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lent key to) K, then candidate text is a valid plaintext 
x, otherwise it is a text encrypted by concatenation of 
d k (e&)). 

Ciphertext: The encrypted text, a string of letters 
from alphabet C of a given cryptosystem by a given 
key Kg %. 

Classical Cipher: A classical cipher system is a 
five-tuple (<P,C,3G£,<D), where <P, Q define plaintext and 
ciphertext alphabet, ^is the set of possible keys, and 
for each K e % there exists an encryption algorithm 
e K e <E, and a corresponding decryption algorithm d K 
g (D such that d K (e K (x)) = x for every input xe(p 
and Kg %. 

Cryptanalysis: Is a process of trying to decrypt 
given ciphertext and/or find the key without, or with 
only partial knowledge of the key. It is also a research 
area studying techniques of cryptanalysis. 

Key-Space: Set of all possible keys for a given 
ciphertext. Key-space can be limited to a subspace of 
the whole TCby some prior knowledge. 

Plaintext: The unencrypted text, a string of letters 
from alphabet (p of a given cryptosystem. 

Plaintexts Filter: An algorithm, or predicate, used 
to determine, which texts are not valid plaintexts. Ideal 
plaintexts filter never produces answer INVALID for 
a correct plaintext. 

Scoring Function: Scoring function is used to 
evaluate fitness of a candidate text for a key k e %. 
Ideal scoring function has global extreme in the correct 
plaintext, i.e. when k = K. 
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INTRODUCTION 

Classical ciphers are used to encrypt plaintext mes- 
sages written in a natural language in such a way that 
they are readable for sender or intended recipient only. 
Many classical ciphers can be broken by brute-force 
search through the key-space. Methods of artificial 
intelligence, such as optimization heuristics, can be 
used to narrow the search space, to speed-up text 
processing and text recognition in the cryptanalytic 
process. Here we present a broad overview of differ- 
ent AI techniques usable in cryptanalysis of classical 
ciphers. Specific methods to effectively recognize the 
correctly decrypted text among many possible decrypts 
are discussed in the next part Automated cryptanalysis 
- Language processing. 



BACKGROUND 

Cryptanalysis can be seen as an effort to translate a 
ciphertext (an encrypted text) to a human language. 
Cryptanalysis can thus be related to the computational 
linguistics. This area originated with efforts in the United 
States in the 1950s to have computers automatically 
translate texts from foreign languages into English, 
particularly Russian scientific journals. Nowadays it is 
a field of study devoted to developing algorithms and 
software for intelligently processing language data. 
Systematic (public) efforts to automate cryptanalysis 
using computers can be traced to first papers written 
in late '70s (see e.g. Schatz, 1977). However, the 
research area has still many open problems, closely 
connected to an area of Artificial Intelligence. It can 
be concluded from the current state-of-the-art, that al- 
though computers are very useful in many cryptanalytic 
tasks, a human intelligence is still essential in complete 
cryptanalysis. 



For convenience of a reader we recall some basic 
notions from cryptography. Very thorough survey of 
classical ciphers is written by Kahn (1974). Amessage 
to be encrypted (plaintext) is written in the lowercase 
alphabet (p- {a, b, c. . . x, y, z}. The encrypted message 
(ciphertext) is written in uppercase alphabet C- {A, B, 
C. . . X, Y, Z}. Different alphabets are used in order to 
better distinguish plaintext and ciphertext, respectively. 
In fact these alphabets are the same. 

There is a reversible encryption rule (algorithm) how 
to transform the plaintext to the ciphertext, and vice- 
versa. These algorithms depend on a secret parameter 
K called the key. The set of possible keys TCis called 
the key-space. Input and output of these algorithms is 
a string of letters from respective alphabets, (f and 
C. Both, sender as well as receiver, uses the same 
secret key, and the same encryption and decryption 
algorithms. 

There are three basic classical systems to encrypt a 
message, namely a substitution, a transposition, and a 
running key. In a substitution cipher a string of letters 
is replaced by another string of letters using prescribed 
substitution of single letters, e.g. left 'a' to 'A', replac- 
ing letter 'b' by letter 'N\ letter 'c' by letter 'G\ etc. A 
transposition cipher rearranges order of letters according 
to a secret key K. Unlike substitution ciphers the fre- 
quency of letters in the plaintext and ciphertext remains 
the same. This characteristic is used in recognizing that 
the text was encrypted by some transposition cipher. 
A typical running key cipher is to derive from a main 
key K the running key K Q K x K 2 . . .K n . lf(p = C = % is a 
group, then simply y. = e K (x) = x. + K. . 

Thus it is convenient to define a ciphering algorithm 
for classical ciphers as follows: 

Definition 1: A classical cipher system is a five- 
tuple ((P,CZC <£,©), where the following conditions are 
satisfied: 
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1. (p is a finite set of a plaintext alphabet, and (f 
the set of all finite strings of symbols from (p. 

2. C is a finite set of a ciphertext alphabet, and C 
the set of all finite strings of symbols from C. 

3. % is a finite set of possible keys. 

4. For each K e %, there is an encryption algorithm 
e K e <£, and a corresponding decryption algorithm 
d K e <D such that d K (e K (x)) = x for every input 
xe (p andK e %. 

5 . The ciphering algorithm assigns to any finite string 
x x x x 2 . . .x n from (f the resulting ciphertext string 

y yiy 2 ' • y n irom <?> where y z = e K ( x ) • The actual 

key may, or need not depend on the index z. 

Another typical case for (p, and C, are r-tuples of 
the Latin alphabet. For transposition ciphers, the key 
is periodically repeated for r-tuples. For substitution 
ciphers of r-tuples, the key is an r-tuple of keys. In the 
case of running keys, there is another key stream gen- 
erator g: %x (p^> % which generates from the initial 
key K, and possibly from the plaintext x Q x x x 2 . . .x nl the 
actual key K n . 

For classical ciphers, there are two typical situations 
when we try to recover the plaintext: 

1 . L et the input to decryption algorithm d R e <D with 
unknown key K be a ciphertext string y Q y 1 y 2 . . .y n 
from C, where y. = e K ( x.j. Our aim is to find 
the plaintext string x Q x I x 2 . . .x n from (f. Thus in 
each execution an algorithm is searching through 
Key-space % 

2. The decryption algorithm d K e (D and key K are 
unknown. Our aim is to find for the ciphertext 
string y y 1 y 2 ...y n from C, where y. = e K (x), the 
plaintext string x Q x i x 2 ...x n from (f. This requires 
a different algorithm than the actual d K e (D, as 
well as some additional information. Usually there 
is available another ciphertext, say z Q z i z 2 . . .z n 
from C. Thus in each execution an algorithm is 
searching through possible substitutions which 
are suitable for both ciphertexts. 

In both cases we need a plaintext recognition sub- 
routine which evaluates a candidate substring of length 
v for a possible plaintext, say c t c i+t c 2+t ...c v+t := x 
x ,^ x o^ • •* ^ • Such automated text recognition needs 

l+t 2+t v+t ° 

an adequate model of a used language. 



AUTOMATED CRYPTANALYSIS 

There are two straightforward methods for automated 
cryptanalysis. Unfortunately none of them is for lon- 
ger strings applicable in practice. The first one is for 
transposition ciphers. When no other information about 
the cipher is known, we can use a general method, 
called anagramming, to decipher the message. In this 
method we are trying to assemble the meaningful string 
(anagram) from the ciphertext. This is accomplished 
by arranging the letters to words from the dictionary. 
When we find the meaningful word we process the rest 
of the message in the same way. When we are not able 
to create more meaningful words, we retrace our steps, 
and try other possible words until the whole meaningful 
anagram is found. 

The second, and very similar, is for the substitution 
ciphers. Here we are trying to assemble the meaning- 
ful string (anagram) from the ciphertext by searching 
through all possible substitutions of letters to get words 
from dictionary of the used language . Although the size 
of the key-space is large, automated cryptanalysis uses 
many other methods based, e.g. on frequency distri- 
bution of letters. Automated cryptanalysis of simple 
substitution ciphers can decrypt most of the messages 
both with known word boundaries (Carrol & Martin, 
1986), and without this information (Ramesh, Athithan 
& Thiruvengadam, 1993; Jakobsen, 1995). There are 
other classical ciphers, where transposition or substitu- 
tion depends not only on the actual key, but also on a 
position within a block of letters of the string. 

For effective automated cryptanalysis at least two 
layers of plaintext candidate processing, filtering and 
scoring, are required. Better results are achieved by 
additional filtering layers. This of course increases 
computational complexity. Bellow we give an overview 
of these filtering layers. 

Automated Brute Force Attacks 

The basic type of algorithm suitable for automated 
cryptanalysis is a brute force attack. As we have to 
search the whole key-space, this attack is only feasible 
when key-space is "not too large". Exact quantification 
of the searchable key-space depends on computational 
resources available to an attacker, and the average 
time needed to verify a candidate for decrypted text. 
Thus, the plaintext recognition is the most critical part 
of the algorithm from the performance point of view. 
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On the other hand, only the most complex algorithms 
achieve really high accuracy of plaintext recognition. 
Thus the careful balance of the complexity of plaintext 
recognition algorithms and its accuracy is required. It 
is unlikely that automated cryptanalysis produces only 
one possible result, but it is possible to limit the set 
of possible decrypts to a manageable size. Reported 
results should be sorted according to their probability 
of being the true plaintext. 

A generic brute force algorithm with plaintext 
recognition can be described by the pseudo-code in 
Exhibit A. 

We have identified three layers of plaintext recogni- 
tion, namely negative test predicate, fast scoring func- 
tion and precise scoring function. All three functions 
are used as a three-layer filter, and final scoring function 
is also used to sort the outputs. First filter should be 
very fast, and should have very low error probability. 
Fast score should be easy to compute, but it is not 
required to precisely identify the correct plaintext. Cor- 
rect plaintext recognition is the role of precise scoring 
function. In the algorithm, the best score is the highest 
one. If the score is computed in the opposite meaning, 
the algorithm must be rewritten accordingly. 

In some cases, we can integrate a fast scoring func- 
tion within negative test or with the precise scoring, 
leading to two-layer filters, as in (Zajac, 2006a). It 
is also possible to use even more steps of predicate- 
based and score-based filtering, respectively. However, 
experiments show that the proposed architecture of 
three-layers is the most flexible, and more layers can 
even lead to performance decrease. Scoring and fil- 



tering is described in-depth in the article Automated 
cryptanalysis - Language processing. 

Applications of Artificial Intelligence 
Methods 

Artificial Intelligence (AI) methods can be used in four 
main areas of the automated cryptanalysis: 



1. 



2. 



3. 



4. 



Plaintext recognition: The goal of the AI is to 
supply negative predicates that filter out wrong 
decrypts, and scoring functions that assess the 
text's likeness to natural language. 
Key-search heuristics: The goal of the AI is to 
provide heuristics to speed-up the decryption 
process either by constraining the key-space, or 
by guiding the selection of next keys to be tried in 
the decryption. This area is most often researched, 
as it can provide clear experimental results, and 
meaningful evaluation. 

Plaintext estimation: The goal of the AI is to 
estimate the meaning of the plaintext from the 
partial decryption, or to estimate some parts of 
the plaintext based on external data (e.g. a sender 
of a ciphertext, historical and geographic context, 
specific grammatical rules etc.) Estimated parts of 
the plaintext can then lead to much easier com- 
plete decryption. This area of research is mainly 
unexplored, and plaintext estimation is done by 
the cryptanalyst. 

Automatic security evaluation: The goal of the 
cryptanalysis is not only to break ciphers and to 



Exhibit A. 



INPUT: ciphertext string Y = y yi y2- • -Yn 

OUTPUT: ordered sequence S of possible plaintexts with their scores 

1. LetS = {} 

2. For each key K e 2Cdo 

2.1. Let X = d K ( Y) be a candidate plaintext. 

2.2. Compute n egative test p redicate filter(X). If predicate is true, continue w ith 
step 2. 

2.3. Compute fast scoring function fastScore(X). If fastScore(X) < LIMITF, continue 
with step 2. 

2.4. Compute precise scoring function score(X). If score(X) < LIMIT, continue with 
step 2. 

2.5. Let S = S u {<score(X), X> } 

3. Sort S by key score(X) descending. 

4. Return S. 
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learn secrets, but it is also used when creating 
new ciphers to evaluate their security. Although 
most classical ciphers are already "outdated", their 
cryptanalysis is still important, e.g. in teaching 
the modern computer security principles. When 
teaching classical ciphers, it is useful to have an 
AI tool (e.g. an expert system), that can automate 
the evaluation of cipher security (at least under 
some weaker assumptions) . Although much work 
is done in automatic evaluation of modern secu- 
rity protocols, we are unaware of some tools to 
evaluate "classical" cipher designs. 

Area that is best researched is the area of Key-search 
heuristics. It immediately follows from the fact that 
brute force search through the whole key-space can 
be considered as a very crude method of decryption. 
Most classical ciphers were not designed with careful 
consideration of the text statistics. We can assign score 
for each key in the key-space that is correlated with the 
probability that text decrypted by given key is the plain- 
text. The score, when considered over the key-space, 
certainly have some local maxima, which can lead either 
immediately to a meaningful plaintext, or a text from 
which plaintext is easily guessed. Thus it can be use- 
ful to consider various relaxation techniques to search 
through the key-space with the goal of maximizing 
scoring function. One of the earliest demonstrations of 
relaxation techniques for breaking substitution ciphers 
are presented by Peleg & Rosenf eld (1979) and Hunter 
& McKenzie (1983). Successful attacks applicable for 
many classical ciphers can be implemented using basic 
hill climbing, through tabu search, simulated annealing 
and applications of genetic/evolution algorithms (Clark 
& Dawson, 1998). Genetic algorithms have achieved 
many successes in breaking classical ciphers as dem- 
onstrated by Mathews (1993), or Clark (1994), and 
can even break a rotor machine (Bagnall, McKeown 
& Rayward-Smith, 1997). Russell, Clark & Stepney 
(1998) present anagramming attack using a solver based 
on an ant colony optimisation algorithm. 

These types of attack try to converge to the correct 
key by small changes of the actual key. Success rate of 
the attacks is usually measured by the fraction of the 
reconstructed key and/or text. Relaxation methods can 
find with a high probability the keys, or the plaintext 
approximations, even if it is not feasible to search the 
whole key-space. The success mainly depends on the 
ciphertext size, since the scoring is usually statistics- 



based. One of the unexplored challenges is to consider 
application of multiple relaxation techniques. First 
heuristic can be used to shrink the key-space, and then 
either the brute-force search or another heuristic is used 
with more precision to finish the decryption. 



FUTURE TRENDS 

The results obtained strongly depend on the size of the 
ciphertext, and decryptions are usually only partial. 
Techniques of the automated cryptanalysis also need 
to be fitted to a given problem. E.g. attacks on substitu- 
tion ciphers can use individual letter statistics, but for 
attacks intended for transposition ciphers these statistics 
are invariant and make no sense in using. Automated 
cryptanalysis is usually studied only in context of these 
two main types of ciphers, but there is a broad area of 
unexplored problems concerning different classical 
cipher types, such as running key type ciphers. Specific 
uses of AI techniques can fail for some cryptosystems 
as pointed by Wagner, S., Affenzeller, M. & Schragl, 
D. (2004). Cryptanalysis also depends on the language 
(Zajac, 2006b), although there are some notable excep- 
tions when considering similar languages. 

As the computational power increases, even just 
recently used ciphers, like Data Encryption Standard 
(DES), are becoming subject of automated cryptanalysis 
(e.g. Nalini & Raghavendra Rao, 2007). Beside ap- 
plication of heuristics to cryptanalysis, a lot of further 
research is required in areas of plaintext estimation 
and automatic security evaluation. An expert system 
that would cover these areas and connect them with 
AI for plaintext recognition and search heuristics 
can be a strong tool to teach computer security or to 
help forensic analysis or historical studies involving 
encrypted materials. 



CONCLUSION 

This article is concerned with an automated cryptan- 
alysis of classical ciphers, where classical ciphers are 
considered as a cipher from before WW2, or pencil- 
and-paper ciphers. Optimization heuristics are quite 
successful in attacks targeted to these ciphers, but 
they usually cannot be made fully-automatic. Their 
application usually differs according to a character of 
the analysed cipher systems. An important research 
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direction is extending the techniques from classical 
cryptanalysis to automated decryption of modern digital 
cryptosystems. Another important problem is to create 
set of fully-automatic cryptanalytic tools or a complete 
expert system that can be adapted to various types of 
ciphers and languages. 
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KEY TERMS 

Brute-Force Attack: Exhaustive cryptanalytic 
technique that searches the whole key-space to find 
the correct key. 

Ciphertext: The encrypted text, a string of letters 
from alphabet C of a given cryptosystem by a given 
key Kg %. 

Classical Cipher: A classical cipher system is a 
five-tuple (<P,C,3G£,<D), where <P, Q define plaintext and 
ciphertext alphabet, TCis the set of possible keys, and 
for each K g % there exists an encryption algorithm 
e K e % and a corresponding decryption algorithm d K 
g <D such that d K (e K (x)) = x for every input xe <p 
and Kg %. 

Cryptanalysis: Is a process of trying to decrypt 
given ciphertext and/or find the key without, or with 
only partial knowledge of the key. It is also a research 
area studying techniques of cryptanalysis. 

Key-Space: Set of all possible keys for a given 
ciphertext. Key-space can be limited to a subspace of 
the whole TCby some prior knowledge. 
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Plaintext: The unencrypted text, a string of letters 
from alphabet <p of a given cryptosystem. 

Relaxation Attack: Cryptanalytic technique that 
searches the key-space by incremental updates of the 
candidate key(s). It usually applies the knowledge of 
previous trial decryption(s) to change some parts of 
the key. 
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INTRODUCTION 

We investigate the application of artificial neural 
networks (ANNs) to the classification of spectra from 
impact-echo signals. In this paper we provide analyses 
from simulated signals and the second part paper details 
results of lab experiments. 

The data set for this research consists of sonic and 
ultrasonic impact-echo signal spectra obtained from 100 
3D-finite element models. These spectra, along with a 
categorization of the materials among homogeneous 
and defective classes depending on the kind of mate- 
rial defects, were used to develop supervised neural 
network classifiers. Four levels of complexity were 
proposed for classification of materials as: material 
condition, kind of defect, defect orientation and defect 
dimension. Results from Multilayer Perceptron (MLP) 
and Radial Basis Function (RBF) neural networks with 
Linear Discriminant Analysis (LDA), and k-Nearest 
Neighbours (kNN) algorithms (Duda, Hart, & Stork, 
2000), (Bishop CM., 2004) are compared. Suitable 
results for LDA and RBF were obtained. 

The impact-echo is a technique for non-destructive 
evaluation based on monitoring the surface motion re- 
sulting from a short-duration mechanical impact. It has 
been widely used in applications of concrete structures 
in civil engineering. Cross-sectional resonant modes in 
impact-echo signals have been analyzed in elements of 
different shapes, such as, circular and square beams, 
beams with empty ducts or cement fillings, etc. In ad- 
dition, frequency analyses of the displacement of the 
fundamental frequency to lower values for detection of 
cracks have been studied (Sansalone & Street, 1997), 
(Carino, 2001). 

The impact-echo wave propagation can be analyzed 
from transient and stationary behaviour. The excitation 
signal (the impact) produces a short transient stage 
where the first P (normal stress), S (shear stress) and 



Rayleigh (superficial) waves arrive to the sensors; af- 
terward the wave propagation phenomenon becomes 
stationary and a manifold of different mixtures of 
waves including various changes of S-wave to P-wave 
propagation mode and viceversa arrive to the sensors. 
Patterns of waveform displacements in this latter stage 
are known as the resonant modes of the material. The 
spectra of impact-echo signals provide of information 
for classification based on resonant modes the inspected 
materials. The classification tree approached in this 
paper has four levels from global to detailed classes 
with up to 12 classes in the lowest level. The levels 
are: (i) Material condition : homogeneous, one defect, 
multiple defects, (ii) Kind of defect : homogeneous, 
hole, crack, multiple defects, (iii) Defect orientation : 
homogeneous, hole in axis X or axis Y, crack in planes 
XY, ZY, or XZ, multiple defects, and (iv) Defect 
dimension : homogeneous, passing through and half 
passing through types of holes and cracks of level iii, 
multiple defects. Some examples of defective models 
are in Figure 1. 



BACKGROUND 

Neural networks applications in impact-echo testing 
include: detect flaws on concrete slabs, combining 
spectra of numerical simulations and real signals for 
network training (Pratt & Sansalone, 1992), identifi- 
cation of unilaterally working sublayer cracks using 
numerically generated waveforms as network inputs 
(Stavroulakis, 1 999), classification of concrete slabs in 
solid and defective (containing void or delamination), 
use of training features extracted from many repeti- 
tions of impact-echo experiments on three specimens 
to be classified in three classes (Xiang & Tso, 2002), 
and to predict shallow crack depths in asphalt pave- 
ments using features from an extensive real signal 
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dataset (Mei, 2004). All these studies used multilayer 
perceptron neural network and monosensor impact- 
echo systems. 

In a recent work, we classified impact-echo data 
by neural networks using temporal and frequency 
features extracted from the signals, finding that the 
better features were frequency features (Salazar, Unio, 
Serrano, & Gosalbez, 2007). Thus the present work 
is focused in exploiting only spectra information of 
the impact-echo signals. These spectra contain a large 
amount of redundant information. We applied Principal 
Component Analysis (PCA) to spectra for compress- 
ing and removing noise. The proposed classification 
problem and the use of spectra PCA components as 
classification features are a new proposal in application 
of neural networks to impact-echo testing. 

There is evidence that the first components of PCA 
retain essentially all of the useful information and this 
compression optimally removes noise and can be used 
to identify unusual spectra (Bailer- Jones, 1996), (Bailer- 
Jones, Irwin, & Hippel, 1998), (Xu et al., 2004). The 
principal components represent sources of variance in 
the data. The projection of the p th spectrum onto the 
k th principal component is known as the admixture 
coefficient a k . The most significant principal compo- 
nents contain those features which are most strongly 
correlated in many of the spectra. It follows that noise 
(which is uncorrelated with any other features by 
definition) will be represented in the less significant 
components. Thus by retaining only the more signifi- 
cant components to represent the spectra we achieve 
a data compression that preferentially remove noise. 
The reduced reconstruction, y of thep th spectrum x , is 
obtained by using only the first r principal components 
to reconstruct the spectrum, i.e. 



Let 8 be the error incurred in using this reduced 
reconstruction. By definition x = y + 8 , so 



y P 



k=r 

= * + Z a k,P u i<> 

k=l 



r<N, 



(1) 



where x is the mean spectrum which is subtracted 
from the spectra before the eigenvectors are calcu- 
lated, and u k is the k th principal component, x can 
be considered as the zeroth eigenvector, although the 
degree of variance it explains depends on the specific 
data set and may be much less than that explained by 
the first eigenvectors. 



Z a k,P u * 



k=r+l 



(2) 



RECOGNITION OF DEFECT PATTERNS 
IN IMPACT-ECHO SPECTRA -SIMULA- 
TIONS 

Impact-Echo Signals 

Simulated signals came from full transient dynamic 
analysis of 100 3D finite element models of simulated 
parallelepiped-shape material of 0.07x0. 05x0. 22m. 
(width, height and length) supported to one third and 
two thirds of the block length (direction z). Figure 1 
shows different examples of the models of defective 
pieces. From the transient analysis the dynamic response 
of the material structure (time-varying displacements 
in the structure) under the action of a transient load is 
estimated. The transient load, i.e. the hammer impact, 
was simulated by applying a force-time history of a half 
sine wave with a period of 64ps as a uniform pressure 
load on two elements at the centre of the model front 
face. The elastic material constants for the simulated 
material were: density 2700 kg/m3, elasticity modulus 
69500 Mpa. and Poisson's ratio 0.22. 

Elements having dimensions of about 0.01 m. were 
used in the models. These elements can accurately 
capture the frequency response up to 40 kHz. Surface 
displacement waveforms were taken from the simula- 
tion results at 7 nodes in different locations on the 
material surface, see Figure la. Signals consisted of 
5000 samples recorded at a sampling frequency of 100 
kHz. To make possible to compare simulations with 
experiments, the second derivative of the displacement 
was calculated to work with accelerations, since the 
sensors available for experiments were mono-axial 
accelerometers. These accelerations were measured 
in the normal direction to the plane of the material 
surface accordingly to the configuration of the sensors 
in Figure la. 

Feature Extraction and Selection 

We investigate if the changes in the spectra, particularly 
in the zones of the fundamental frequencies, are related 
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Figure 1. Finite element models with different defects and 7 -sensor configuration 



sensors a 

(acceterometets) ? 




impact excitation 

la. Half-passing through crack oriented in plane ZY 



lb. passing through hole oriented in axis Y 



with the shape, orientation and dimension of the de- 
fects. The information of the spectra for each channel 
consists of n/2 values as half of the number of points 
used to calculate the Fast Fourier Transform (FFT). 
Due to the 7-channel impact-echo system setup applied, 
the number of data available for each impact-echo test 
was 7*n/2, e.g. for a FFT calculated with 256 points, 
896 values would be available as entries for classi- 
fiers. This high number of entries could be unsuitable 
for the training stage of neural networks. Considering 
impact-echo signal spectra redundancy, PCA was ap- 
plied in two steps. At first step, PCA was applied to the 
spectra of each channel as a feature extraction method. 
At second step, PCA was applied to the component set 
(spectra compressed) obtained in the first step for all 
the channels and records as dimensionality reduction 
and feature selection method. Thus, a compressed and 
representative pattern of the spectra for the multichan- 
nel impact-echo inspection was obtained. 

The size of the FFT employed was 1024 points since 
using less points the resolution was not good enough 
for classifications. Once the spectra were estimated for 
all the models they were grouped and normalized by 
maximum per channel. There were considered three 
options to establish the number of components at the 
first PCA step: select a number of components that 
explain a minimum of the variance in the data, or a 
number of components such the variance increment 
is minimum, or a fixed number of components. The 



first two options could estimate a variable number of 
components per channel, and they could select more 
components for the channels with 'worst' signals, i.e. 
signals with low signal to noise relation (SNR), due to 
problems in measuring (e.g. bad contact in the interface 
sensor and material). Thus we select a fixed number 
of components=20 per channel, that explained more 
than 95% of the data variance for each of the channels, 
so the total number of components was 7*20=140 for 
one model. 

The initial entries for the classification stage were 
then 140 features (spectra components) for the 100 
simulation models. For simulations 20 replicates for 
each model were added that corresponded to the rep- 
etitions performed in the experiments. The replicates 
were generated using random Gaussian noise with 
0.1-standard deviation of the original signals; then total 
of records for simulations was 2000 with 140 spectra 
components. 

PC A was applied again to reduce the dimensionality 
of the classification space and to select the best spectra 
features for classification. After some preliminary tests, 
5 was set as a number of components for classification. 
Using this number of components, the explained vari- 
ance was 98%. With the 50 sorted components obtained, 
an iterative process of classification varying the number 
of components was applied using LDA and kNN as 
classifiers. The curve described by the set of classifica- 
tion error and number of components (5,10,15,. . .,50) 



194 



Automatic Classification of Impact-Echo Spectra I 



values has an inflection point where the information 
provided for the components perform the best clas- 
sification. Following this feature selection process, a 
reduced set of features ('better' spectra components) 
was obtained. Those features were used as entries for 
ANNs, improving the performance of the classifica- 
tion, instead of using all the spectra components. The 
number of selected components for ANN classification 
varied from 20 to 30, depending on classification level 
(material condition, kind of defect, defect orientation, 
defect dimension). 

The classification proceeded applying the Leave- 
One-Out method, avoiding records of replicas or 
repetitions of a test piece were in the training stage of 
that piece, so generalization of pattern learning was 
forced. Thus some of the records used in training and 
test corresponded to models or specimens with the 
same kind of defect but located in different positions, 
and the rest of records corresponded to others kind of 
defective pieces. Results presented in next sections are 
referring to mean error in testing stage. 

Simulation Results 

Figure 2a shows the results of classification by kNN and 
LDA with linear, Mahalanobis, and quadratic distances 
for simulations at level 4 of the classification tree. The 
best percentage of classification success (75.9) is ob- 
tained by LDA-quadratic and LDA-Mahalanobis with 
25 components. Those components were selected and 
used as inputs for the input layer of the networks. One 
hidden layer was used (different number of neurons 
were tried to obtain the best configuration of the neu- 
ron number at this layer), and the number of neurons 
at the output layer was set as the number of classes, 
depending on the classification level. A validation 
stage and resilient propagation training method were 
used in classifications with MLR The spread parameter 
was tuned for RBF, Figure 2b shows how the spread 
affects the classification results in the "defect dimen- 
sion level", and in this case the minimum error (0.31) 
is for spread value 1.6. 

Summarised general results by different classifica- 
tion methods for simulations are showed in Table 1. 
The best classification performance is obtained by LDA 
with quadratic distance, but results of RBF are fairly 
comparable. Due to classes are not equally-probable 
at each level, general results are weighted by class 
probability, see Figure 3. Homogeneous class was 



completely distinguishable and multiple-defects class 
was the worst classified at every classification levels. 
The percentage of success could be very much higher 
by increasing classification success for multiple-defect 
class. This fact was causedbecause the multiple-defects 
models consisted in models with various cracks, and it 
yield confusion between the crack and multiple-defect 
classes. The percentage of success decreases for more 
complex classifications, with RBF lowest performance 
of 69% for 12 classes. 



FUTURE TRENDS 

The proposed methodology was tested with particu- 
lar kind of material and defects and configuration of 
multichannel testing. It could be tested using models 
and specimens of different materials, sizes, sensor 
configurations, and signal processing parameters. 

There exist several techniques and algorithms of 
classification that can be explored for the proposed 
problem. Recently a model of independent component 
analysis (ICA) was proposed for impact-echo (Sala- 
zar, Vergara, Igual, Gosalbez, & Miralles, 2004), and 
new classifiers based on mixtures of ICAs have been 
proposed (Salazar, Vergara, Igual, & Gosalbez, 2005), 
(Salazar, Vergara, Igual, & Serrano, 2007), that include 
issues as semisupervision in training stage. The use 
of prior knowledge in the training stage is critical in 
order to obtain suitable models for different kind of 
classifications. Those kind of techniques could give 
more understating on how labelled and labelled data 
change model learned by the classifier. In addition 
more research is needed on the shape of the clas- 
sification space (impact-echo signal spectra), outlier 
probability, and decision region of the classes for the 
proposed problem. 



CONCLUSION 

We demonstrate the feasibility of using neural networks 
to extract patterns of different kinds of defects from 
impact-echo signal spectra in simulations. The meth- 
odology used was very restricted because there was 
only one piece for a defect in certain localization in the 
bulk and it was not in the training stage, so classifier 
had to assign the right class with the patterns of pieces 
of the same class in other localizations. Results could 



195 



Automatic Classification of Impact-Echo Spectra I 



Figure 2. LDA, kNN results and tuning of RBF parameter at Simulations, level 4 of classification 
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Table 1. Summarised classification results for simulations 





Error (%) 


Level 1 
(3 classes) 


Level 2 
(4 classes) 


Level 3 
(7 classes) 


Level 4 
(12 classes) 


en 

C 

a 

c/5 


LDA-L 


6 


13 


30 


29 


LDA-Q 


8 


9 


19 


24.1 


LDA-M 


11.6 


9 


19 


24.1 


kNN 


8 


14 


25 


29 


MLP 


9 


16 


31 


39 


RBF 


8 


17 


26 


31 



be used to implement the proposed method in real ap- 
plications of quality evaluation of materials; in those 
applications the database collected during reasonable 
time could have samples similar to the tested piece, 
making easier the classification process. 
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Figure 3. Percentages of success in classifications by RBF 
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KEY TERMS 

Artificial Neural Network (ANN) : Amathematical 
model inspired in biological neural networks. The units 
are called neurons connected in various input, hidden 
and output layers. For a specific stimulus (numerical 
data at the input layer) some neurons are activated 
following an activation function and producing numeri- 
cal output. Thus ANN is trained, storing the learned 
model in weight matrices of the neurons. This kind 
of processing has demonstrated to be suitable to find 
nonlinear relationships in data, being more flexible 
in some applications than models extracted by linear 
decomposition techniques. 

Finite Element Method (FEM): It is a numerical 
analysis technique to obtain solutions to the differential 
equations that describe, or approximately describe a 
wide variety of problems. The underlying premise of 
FEM states that a complicated domain can be sub-di- 
vided into a series of smaller regions (the finite elements) 
in which the differential equations are approximately 
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solved. By assembling the set of equations for each 
region, the behavior over the entire problem domain 
is determined. 

Impact-Echo Testing: A non-destructive evalua- 
tion procedure based on monitoring the surface motion 
resulting from a short-duration mechanical impact. 
From analyses of the vibrations measured by sensors, a 
diagnosis of the material condition can be obtained. 

Non-Destructive Evaluation (NDE): NDE, ND 

Testing or ND Inspection techniques are used in quality 
control of materials. Those techniques do not destroy 
the test object and extract information on the internal 
structure of the object. To detect different defects such 
as cracking and corrosion, there are different methods 
of testing available, such as X-ray (where cracks show 
up on the film), ultrasound (where cracks show up as 
an echo blip on the screen) and impact-echo (cracks 
are detected by changes in the resonance modes of 
the object). 

Pattern Recognition: An important area of research 
concerned to discover or identify automatically figures, 
characters, shapes, forms, and patterns without active 
human participation in the decision process. It is also 



related with classify data in categories. Classification 
consists in learning a model for separating the data 
categories, that kind of machine learning can be ap- 
proached using statistical (parametric or no-parametric 
models) or heuristic techniques. If some prior informa- 
tion is given in learning process, it is called supervised 
or semi-supervised, else it is called unsupervised. 

Principal Component Analysis (PCA): Amethod 
for achieving a dimensionality reduction. It represents 
a set of iV-dimensional data by means of their projec- 
tions onto a set of r optimally defined axes (principal 
components). As these axes form an orthogonal set, 
PCA yields a data linear transformation. Principal 
components represent sources of variance in the data. 
Thus the most significant principal components show 
those data features which vary the most. 

Signal Spectra: Set of frequency components 
decomposed from an original signal in time domain. 
There exist several techniques to map a function in time 
domain to frequency domain as Fourier and Wavelet 
transforms, and its inverse transforms that allow re- 
constructing the original signal. 
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INTRODUCTION 

We study the application of artificial neural networks 
(ANNs) to the classification of spectra from impact-echo 
signals. In this paper we focus on analyses from experi- 
ments. Simulation results are covered in paper I. 

Impact-echo is a procedure from Non-Destructive 
Evaluation where a material is excited by a hammer 
impact which produces a response from the material 
microstructure. This response is sensed by a set of 
transducers located on material surface. Measured 
signals contain backscattering from grain microstruc- 
ture and information of flaws in the material inspected 
(Sansalone & Street, 1997). The physical phenomenon 
of impact-echo corresponds to wave propagation in 
solids. When a disturbance (stress or displacement) is 
applied suddenly at a point on the surface of a solid, such 
as by impact, the disturbance propagates through the 
solid as three different types of stress waves: a P-wave, 
an S-wave, and an R-wave. The P-wave is associated 
with the propagation of normal stress and the S-wave 
is associated with shear stress, both of them propagate 
into the solid along spherical wave fronts. In addition, 
a surface wave, or Rayleigh wave (R-wave) travels 
throughout a circular wave front along the material 
surface (Carino, 2001). 

After a transient period where the first waves ar- 
rive, wave propagation becomes stationary in resonant 
modes of the material that vary depending on the defects 
inside the material. In defective materials propagated 
waves have to surround the defects and their energy 
decreases, and multiple reflections and diffraction with 
the defect borders become reflected waves (Sansalone, 
Carino, & Hsu, 1998). Depending on the observation 
time and the sampling frequency used in the experi- 
ments we may be interested in analyzing the transient 
or the stationary stage of the wave propagation in im- 



pact-echo tests. Usually with high resolution in time, 
analyzes of wave propagation velocity can give useful 
information, for instance, to build a tomography of a 
material inspected from different locations. Considering 
the sampling frequency that we used in the experiments 
(100 kHz), a feature extracted from the signal as the 
wave propagation velocity is not accurate enough to 
discern between homogeneous and different kind of 
defective materials. 

The data set for this research consists of sonic and 
ultrasonic impact-echo signal (1-27 kHz) spectra ob- 
tained from 84 parallelepiped-shape (7x5x22cm. width, 
height and length) lab specimens of aluminium alloy 
series 2000. These spectra, along with a categoriza- 
tion of the quality of materials among homogeneous, 
one-defect and multiple-defect classes were used to 
develop supervised neural network classifiers. We 
show that neural networks yield good classifications 
(<15% error) of the materials in four levels of clas- 
sification detail as material condition, kind of defect, 
defect orientation and defect dimension. Results for 
Multilayer Perceptron (MLP) and Radial Basis Function 
(RBF) neural networks, Linear Discriminant Analysis 
(LDA), and k-Nearest Neighbours (kNN) algorithms 
(Duda, Hart, & Stork, 2000), (Bishop CM., 2004) are 
presented. Figure 1 shows the scheme of categories 
proposed as a hierarchical layout with different levels 
of knowledge on the material defects (the percentage 
of success in classification is explained in Experimental 
Result section). 
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Figure 1. Classification tree with percentages of success in classification by RBF network. Numbers in brackets are results 
for simulations (paper I). General results are weighted by class probability since classes are not equally-probable. 
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BACKGROUND 

The phenomenon of volumetric wave propagation in 
impact-eco can be modelled by means of the following 
two equations (Cheeke J.D., 2002), 



dT ij d\ 
= Po — T 



T ij ~ c ijkl S kl 



(1) 



(2) 



where 



u: 

i 

dTu 



dxj 



^ijkl' 



Material density. 

Length elongation with respect to starting 

point in force direction. 

Force variation in i direction due to deforma- 
tions in j directions. 
Elastic constant tensor (Hooke's law). 
Strain or relative volume change under de- 
formation in face / in direction k in unitary 
cube that represents a material element. 



Thus force variation in the direction z due to face 
stresses in j directions of the material elementary cube, 
is equal to the mass per volume (density) times the strain 
acceleration (Newton's third law in tensorial form). To 



derive an analytical solution to problems that involve 
stress wave propagation in delimited solids is very 
complicated, so bibliography on this subj ect is not very 
extensive. Numeric models such as the Finite Element 
Method (FEM) can be used to obtain an approxima- 
tion to the material theoretical response (Abraham O, 
Leonard C, Cote P., & Piwakowski B., 2000). 

There are several studies that used the impact-echo 
signals in frequency domain to detect the existence 
of defects in materials (Sansalone et al., 1997), (Hill, 
McHung, & Turner, 2000), (Sansalone, Lin, & Street, 
1998). It has been demonstrated that a sequence of 
tones and harmonics appears in the spectra, they are 
fundamental modes of propagation that travel inside 
the material (block-shape material) and its frequencies 
depend on the shape and size of the material inspected 
by impact-echo. According to the block face where a 
sensor is located, some or others fundamental modes 
are captured. However, other tones are formed by the 
reflections of the waves with the defects in the material, 
and their frequencies are related with the deepness of 
the flaws. In addition, the presence of defects causes 
shifting of the fundamental mode frequencies due to 
diffractions. 

MLP neural network has been applied to impact- 
echo in mono-sensor configurations (using only one 
accelerometer) to detect flaws on concrete slabs (Pratt 
& Sansalone, 1992), identification of unilaterally work- 
ing sublayer cracks (Stavroulakis, 1 999), classification 
of concrete slabs in solid and defective (Xiang & Tso, 
2002). Those applications used a few number of ex- 
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periments and many repetitions or combined simulated 
with experimental signals, so its results maybe verified 
because of probable overfitting. Other application is 
to predict shallow crack depths in asphalt pavements 
using features from an extensive real signal dataset 
(Mei, 2004). Recently, we provided an application of 
MLP, RBF and LVQ to classification tree proposed 
here using temporal and frequency features extracted 
from the signals, finding that the better features were 
frequency features (Salazar, Unio, Serrano, & Gosal- 
bez, 2007). 

In this paper we demonstrate the suitability of PCA 
application on the impact-echo signal spectra to obtain 
complex classifications in real experiments. The first 
components of PCA retain essentially all of the useful 
information and this compression optimally removes 
noise. The principal components represent sources of 
variance in the data. Thus the most significant spectra 
principal components show those features which vary 
the most between the spectra: it is important to realise 
that the principal components do not simply represent 
strong features. The principal components are eigenvec- 
tors of a symmetric matrix; they are simple rotations 
in the iV-dimensional data space of the original axes 
on which the spectra are defined, thus they resemble 
the spectra (Bailer-Jones, 1996), (Bailer-Jones, Irwin, 
& Hippel, 1998), (Xu et al., 2004). 



RECOGNITION OF DEFECT 
PATTERNS IN IMPACT-ECHO 
SPECTRA-EXPERIMENTS 

Impact-Echo Signals 

The equipment used in experiments was composed 
of: an instrumented hammer 084A14 PCB, 7 mono- 
axial accelerometers 353B17 PCB, a data acquisition 
module NI 6067E, a ICP signal conditioner F482A18 
and a notebook for signal processing and control. The 
sample frequency in signal acquisition was 100,000 
kHz, and observation time recorded was 30 ms. Figure 
2a shows a photograph of the equipment employed in 
experiments, note that a 7x5x22cm. specimen with 
sensors positioned is being tested. Figure 2b shows 
a layout of the sensor locations on the surface of the 
piece (1 sensor at the back face, 4 sensors at the side 
faces, and 2 sensors at the top face), supports, and 
place of the impact. Sensors S4, S6, S8 are located at 
one third and S3, S5, S7 are located at two thirds of 
the piece length trough axis Z. S2 are in the middle of 
the opposite face to the impact. 

The defects consisted in holes in the form of 10 mm. 
cylinders, and cracks in the form of 5 mm. paral- 
lelepipeds with different orientations through the axes 
(X, Y) and planes (XY, ZY, XT) of the material block. 
The dimensions of the defects were two: passing and 
half-passing through. Figure 2b shows a diagram of a 
defect of the class "half-passing trough crack oriented 
in plane ZY\ The complete set of defective materials 
analyzed is depicted in Figure 1. 



Figure 2. Experimental setup and sensor configuration 
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Feature Extraction and Selection 

The methodology followed for feature extraction, 
feature selection, reduction of dimensionality and 
classification in the impact-echo signal spectra was 
the applied in Paper I. After signal acquisition a four- 
stage procedure was followed: feature extraction, 
dimensionality reduction, feature selection, and clas- 
sification with ANNs. 

In the feature extraction stage, a 1024-points 
FFT was applied to the measured signals and these 
spectra were compressed by PCA, selecting the first 
20 components per each channel. Thus entries for the 
dimensionality reduction stage were 140 components 
(7channelsx20) for the 84 lab specimens. For each 
experiment (specimen) were performed around 22 
repetitions, so the total of records was 1881 for experi- 
ments each one with 140 spectra components. In the 
dimensionality reduction stage PCA reduced the 140 
spectra components to 50 spectra components with a 
92% explained variance. This matrix of 50 selected 
components by 1881 records was the input for a feature 
selection process which objective was to found the 
"best" number of components for classification. Then 
various tests of classification using LD A and kNN vary- 
ing the number of components from 5 to 50 by incre- 
ments of 5 were applied. The components corresponding 
to the best percentage of success in classification with 
kNN and LDA were selected as entries for the stage 
of classification with MLP and RBF. The number of 
spectra components varied from 10 to 30 depending 
on the classification level. Parameters as spread for 
RBF, and the number of neurons in the hidden layer 
for MLP were tuned to obtain the best classification 
percentage of success of the ANNs. 



All the classification used Leave-One-Out method. 
Repetitions of a piece in testing were not used in its 
training stage to avoid classifier to memorize the 
pieces instead of to generalize patterns. Table 1 shows 
summarised results for all the classifiers applied at the 
different levels of classification, these results refer to 
mean error in testing stage. 

Experimental Results 

General results of classifications for experiments in 
Table 1 show the RBF as the best classifier, improving 
its performance near to 20% with regard to simulation 
results in paper I at the more complex level of clas- 
sification (12 classes). The percentage of classification 
success improved for every class at each level, particu- 
larly for multiple-defect class from 25% up to 92.6% 
at first level and 89.1% at fourth level, see Figure 1. 
In experiments, specimens with multiple-defects were 
prepared combining cracks and holes, so there was not 
much confusion with multiple-defect and one-defect 
classes. 

Real experiments of impact-echo involved random 
variables in its execution, as the force injected in the 
impact excitation, and the position of the sensors that 
can vary from piece to piece due to they are manually 
controlled. Those variables yield repetitions of the 
experiments with its corresponding signal spectra that 
separate better class regions than Gaussian noise used 
to obtain replicates of the simulated model signals. 

The results of experiment classifications confirm the 
feasibility of using neural networks for pattern recogni- 
tion of defects in impact-echo signals. Table 2 contains 
the confusion matrix at level "defect orientation". Ho- 
mogeneous class is perfectly classified, and all the rest 



Table 1. Summarised classification results for experiments 





Error (%) 


Level 1 
(3 classes) 


Level 2 
(4 classes) 


Level 3 
(7 classes) 


Level 4 
(12 classes) 


c 

CD 



S-H 

CD 

X 
W 


LDA-L 


7.8 


18.3 


28.5 


39.6 


LDA-Q 


6.7 


15.3 


20.8 


26.3 


LDA-M 


5.1 


21 


21.8 


28.9 


kNN 


3.3 


14.2 


20.5 


23.2 


MLP 


5.6 


19.7 


30.2 


40.6 


RBF 


1.75 


10.88 


10.99 


11.93 
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Table 2. Confusion matrix obtained by RBF at experiments, level 3 of classification 





Homogeneous 


Hole 
X 


Hole 
Y 


Crack 
XY 


Crack 
ZY 


Crack 
XZ 


Multiple 
defects 


Homogeneous 


1,000 


0,000 


0,000 


0,000 


0,000 


0,000 


0,000 


HoleX 


0,000 


0,914 


0,000 


0,000 


0,048 


0,000 


0,038 


HoleY 


0,000 


0,000 


0,488 


0,107 


0,298 


0,107 


0,000 


Crack XY 


0,000 


0,000 


0,022 


0,909 


0,009 


0,025 


0,034 


Crack ZY 


0,000 


0,009 


0,049 


0,003 


0,930 


0,009 


0,000 


Crack XZ 


0,000 


0,000 


0,005 


0,003 


0,044 


0,949 


0,000 


Multiple 
defects 


0,000 


0,065 


0,000 


0,034 


0,000 


0,013 


0,888 



of six classes are well classified, except the "hole Y" 
class (48.8% success). This class is frequently confused 
with all the classes of cracks; it could be due to defect 
geometry does not allow produce a discernible wave 
pattern from propagation wave phenomena. In addition 
multiple-defect class is sometimes confused with cracks 
and hole X. It is due to particular patterns of one of 
the defects inside some multiple-defect specimens are 
more dominant in the spectra, causing multiple-defect 
spectra be alike to crack or hole Y spectra. 



(Salazar, Vergara, Igual, & Serrano, 2007). That kind 
of modelling and learning procedure could be suitable 
for the classification of materials tested by impact- 
echo. Training stage and percentage of supervision is 
a critical subject in order to develop a suitable model 
from the data for classification. Thus depending on the 
kind of defective materials used in training a better 
adapted model for a specific classification would be 
defined. Then a decision fusion made by various clas- 
sifiers could be more suitable than the decision made 
by one classifier. 



FUTURE TRENDS 

The problem of material evaluation defined different 
levels of classification in a hierarchical outline with dif- 
ferent kind of insight on quality of the tested material. 
It could be considered restate the problem to classify 
defects by ranges of defect size, independently of its 
shape or orientation, this kind of classification is very 
useful in industries as marble factories. The applicabil- 
ity of the proposed methodology has to be confirmed 
with application on different materials. 

RBF neural network yielded good results for all 
levels of classification, but more algorithms have to be 
tested, taking into account the feasibility of its imple- 
mentation in a real-time application and the improve- 
ment of the classification percentage of success. For 
instance, new algorithms of classification exploit linear 
dependencies in the data, and allow semi-supervised 
learning (Salazar, Vergara, Igual, Gosalbez, & Miralles, 
2004), (Salazar, Vergara, Igual, & Gosalbez, 2005), 



CONCLUSION 

We demonstrate the feasibility of using neural networks 
to extract patterns of different kinds of defects from 
impact-echo signal spectra in lab experiments. General 
results of the applied neural networks show RBF as the 
more suitable technique for the impact-echo problem 
even in complex levels of classifications, discerning 
up to 12 classes of homogeneous, one-defective and 
multiple-defect materials. 

The proposed methodology has yield encouraging 
results with controlled lab experiments (same dimen- 
sions of the specimens, good-wave propagation mate- 
rial, and well-defined defects). The procedure has to 
be tested for processing real industry materials with 
a range of different dimensions, kind of defects and 
microstructures for which impact-echo signal spectra 
define fuzzy regions for classification. 
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KEY TERMS 

Accelerometer: A device that measures accelera- 
tion which is converted into an electrical signal that 
is transmitted to signal acquisition equipment. In 
impact-echo testing, the measured acceleration refers 
to vibration displacements caused by the excitation of 
the short impact. 

Dimensionality Reduction: A process to reduce 
the number of variables of a problem. Dimension of a 
problem is given by the number of variables (features 
or parameters) that represent the data. After signal fea- 
ture extraction (that reduce the original signal sample 
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space), the dimensionality may be reduced more by 
feature selection methods. 

Fast Fourier Transform (FFT): A class of algo- 
rithms used in digital signal processing to compute the 
Discrete Fourier Transform (DFT) and its inverse. It has 
the capability of taking functions from the time domain 
to the frequency domain. The frequency components 
obtained are the spectra of the signal. 

Feature Extraction (FE): A process to map a 
multidimensional space into a space of fewer dimen- 
sions. In signal processing, instead of processing raw 
signals with thousands of samples is more efficient 
to process features extracted from the signals, such 
as, signal power, principal frequency, and attenuation 
coefficient. 

Feature Selection (FS): A technique that selects 
a subset of features from a given set of features that 
represent the relevant properties of the data. FS also 
may be define as the task of choosing a small subset of 
features which is sufficient to predict the target labels 
well, is crucial for efficient learning. There are several 
FS methods based on margins (e.g., relief, simba) or 
information theory (e.g., infogain). Supervised FS 
methods use a priori knowledge on a classification 
variable, to select variables high correlated with the 
known variable. 

Leave-One-Out: A method used in classification 
with the following steps: i.) Label the database cases 
with the known classes, ii.) Select a case of the da- 
tabase, iii.) Estimate the class for selected case by a 
classifier using the remaining cases as training data, 
iv.) Repeat steps ii and iii until the end of the cases, 
v.) Calculate the mean percentage of success for clas- 
sification results. 

Signal Conditioner (SC): A device that converts 
one type of electronic signal into another type of sig- 
nal. Its primary use is to convert a signal that may be 
difficult to read by conventional instrumentation into 
a more easily read format. Typical SC functions are 
amplification, electrical isolation, and linearization. 
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INTRODUCTION 

The defect detection on manufactures is of utmost 
importance in the optimization of industrial processes 
(Garcia 2005). In fact, the industrial inspection of 
engineering materials and products tends to the detec- 
tion, localization and classification of flaws as quickly 
and as accurately as possible in order to improve the 
production quality. In this field a relevant area is 
constituted by visual inspection . Nowadays, this task 
is often carried out by a human expert. Nevertheless, 
such kind of inspection could reveal time-consuming 
and suffer of low repeatability because the judgment 
criteria can differ from operator to operator. Further- 
more, visual fatigue or loss of concentration inevitably 
lead to missed defects (Han, Yue & Yu 1999, Kwak, 
Ventura & Tofang-Sazi 2000, YA. Karayiannis, R. 
Stojanovic, P. Mitropoulos, C.Koulamas, T. Stouraitis, 
S. Koubias & G. Papadopoulos 1999, Patil, Biradar & 
Jadhav 2005). 

In order to reduce the burden of human testers and 
improve the detection of faulty products, recently many 
researchers have been engaged in developing systems 
in Automated Visual Inspection (AVI) of manufactures 
(Chang, Lin & Jeng 2005, Lei 2004, Yang, Pang & 
Yung 2004). These systems reveal easily reliable from 
technical point of view and mimic the experts in the 
evaluation process of defects appropriately (Bahlmann, 
Heidemann & Ritter 1999), even if defect detection in 
visual inspection can become a hard task. In fact, in 
industrial processes a large amount of data has to be 
handled and flaws belong to a great number of classes 
with dynamic defect populations, because defects 
could present similar characteristics among different 
classes and different interclass features (R. Stojanovic, 
P. Mitropulos, C.Koullamas, Y Karayiannis, S. Koubias 
& G. Papadopoulos 2001). Therefore, it is needed that 
visual inspection systems are able to adapt to dynamic 
operating conditions. To this purpose soft computing 



techniques based on the use of Artificial Neural Net- 
works (ANNs) have already been proposed in several 
different areas of industrial production. In fact, neural 
networks are often exploited for their ability to recog- 
nize a wide spread of different defects (Kumar 2003, 
Chang, Lin & Jeng 2005, Garcia 2005, Graham, Maas, 
Donaldson & Carr 2004, Acciani, Brunetti & Fornarelli 
2006). Although adequate in many instances, in other 
cases Neural Networks cannot represent the most suit- 
able solution. In fact, the design of ANNs often requires 
the extraction of parameters and features, during a 
preprocessing stage, from a suitable data set, in which 
the most possible defects are recognized (Bahlmann, 
Heidemann & Ritter 1999, Karras 2003, Rimac-Drlje, 
Keller & Hocenski 2005). Therefore, methods based 
on neural networks could be time expensive for in- 
line applications because such preliminary steps and 
could reveal complex (Kumar 2003, Kwak, Ventura 
& Tofang-Sazi 2000, Patil, Biradar & Jadhav 2005, R. 
Stojanovic, P. Mitropulos, C.Koullamas, Y Karayian- 
nis, S. Koubias & G .Papadopoulos 2001). For this 
reason, when in an industrial process time constraints 
play an important role, a hardware solution of the 
abovementioned methods can be proposed (R. Sto- 
janovic, P. Mitropulos, C.Koullamas, Y Karayiannis, 
S. Koubias & G .Papadopoulos 2001), but such kind 
of solution implies a further design effort which can 
be avoided by considering Cellular Neural Networks 
(CNNs) (Chua & Roska 2002). 

Cellular Neural Networks have good potentiality to 
overcome this problem, in fact their hardware imple- 
mentation and massive parallelism can satisfy urgent 
time constrains of some industrial processes, allowing 
the inclusion of the diagnosis inside the production 
process. In this way the defect detection method could 
enable to work in real time according to the specific 
industrial process. 
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BACKGROUND 

Cellular Neural Networks consist of processing units 
C(i, j), which are arranged in an MxN grid, as shown 
in Figure 1. 

The generic basic unit C(i, j) is called cell: it cor- 
responds to a first-order nonlinear circuit, electrically 
connected to the cells, which belong to the set Sfi, j), 
named sphere of influence of the radius r of C(i, j). 
Such set Sfi,j) is defined as: 



S r {i,j) = \c{k,l) 



max (Ik-zU/- j\)<r 

l<k<M,l<l<N Vl ' ' !/ 



An MxN Cellular Neural Network is defined by 
an MxN rectangular array of cells C(i y j) located at 
site (i,j), i =1, 2, .., M, j = 1, 2, ..., N Each cell C(i,j) 
is defined mathematically by the following state and 
output equations: 



-JT = ~ x iJ + 2- A(i, j;k,l)y kl + ^ B(i, j;k,l)u M + z tj 

C(/c,/)eS r (f,j) C(/c,/)eS r (f,j) 



^(x . +l|-|x - -1|) 



where x.. e 9i, y., e 9i and z.. e 9i are state, output and 
threshold of cell C(i,j),y kl e 9?, and i/ w e 9? are output 
and input of cell C(k, I), respectively. A(i, j; k, /) and 



Figure 1. Standard CNN architecture 
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B(i,j; k, /) are called the feedback and the input synaptic 
operators and uniquely identify the network. 

The reported circuit model constitutes a hardware 
paradigm which allows fast processing of signals. 
For this reason, in the past CNNs were considered 
as an useful framework for defect detection in indus- 
trial applications (Roska 1992). Successively different 
CNN-based contributions working in real time and 
aiming at the defect detection in the industrial field 
have been proposed (Bertucco, Fargione, Nunnari 
& Risitano 2000), (Occhipinti, Spoto, Branciforte & 
Doddo 2001), (Guinea, Gordaliza, Vicente & Garcia- 
Alegre 2000), (Perfetti & Terzoli 2000). In (Bertucco, 
Fargione, Nunnari & Risitano 2000) and (Occhipinti, 
Spoto, Branciforte & Doddo 2001) non-destructive 
control of mechanical parts in aeronautical industrial 
production is carried out defining an algorithm which 
is implemented by means of CNNs entirely. These 
methods reveal effective, but a complex acquisition 
system is required to provide information about the 
defectiveness. In (Guinea, Gordaliza, Vicente & Garcia- 
Alegre 2000) CNNs constitute the core processors of 
a system which realizes an automatic inspection of 
metal laminates, whereas in (Perfetti & Terzoli 2000) 
two CNN-based algorithms are proposed in order to 
detect stains and irregularities in a textile application. 
In both works real-time is guaranteed, but in (Guinea, 
Gordaliza, Vicente & Garcia- Alegre 2000) synthesis 
criteria of CNN circuit parameters could reveal difficult 
to satisfy, whereas in (Perfetti & Terzoli 2000) such 
criteria are not defined. 

In the following section a CNN-based method, that 
enables to overcome the most of drawbacks which arise 
in the reported approaches, is proposed. 



AUTOMATIC DEFECT DETECTION 
METHOD 

In this section an automatic method for the visual in- 
spection of surface flaws of manufactures is proposed. 
This method is realized by means of a CNN-based 
architecture, which will be accurately described in the 
companion chapter (Fornarelli & Giaquinto 2007). 

The suggested approach consists of three steps. The 
first one realizes a preprocessing stage which enables 
to identify eventual defected areas; in the second stage 
the matching between such pre-processed image and 
a reference image is performed; finally, in the third 
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step an output binary image, in which only defects are 
represented, is yielded. 

The proposed solution needs nor complex acquisi- 
tion system neither feature extraction, in fact the image 
is directly processed and the synthesis parameters of the 
system are evaluated from the statistical image proper- 
ties automatically. Furthermore, the proposed system 
is well suited for single board implementation. 

The scheme that represents the proposed method 
is shown in Figure 2. 

As it can be observed, it is formed by three modules : 
a Preprocessing module, an Image Matching module 
and a Defect Detection one. The input images, named 
O and R, are acquired by means of a camera, which 
yields 256-gray levels images, whose dimensions are 
m x n. The image O represents the manufacture under 
test or a part of it. Such image contains the Region of 
Interest (ROI), that is the specific region of an object, 
in which defects are to be detected. 

The image R constitutes a reference image , in which 
a product without defects (or its part) is depicted. Such 
image is stored in a memory and acquired off-line 
during the phase of system calibration. It is used to 
detect possible variations caused by the presence of 
dents, scratches or breakings on an observed surface. 
In order to allow a good match between the reference 
image and the under test one, the preprocessing blocks 
realize a contrast enhancement providing images O p 
and Rp, that constitute the inputs for the subsequent 



Figure 2. Block diagram of the proposed CNN-based 
method 



Image Matching module. The target of this block is 
finding the minimum difference between the two im- 
ages O p and R F . In fact, during the production process, 
the acquiring system could give images in which the 
manufacture is shifted according the four cardinal 
directions. This implies that the difference between 
O p and Rp could lead to the detection of false defects. 
The Image Matching Module minimizes such effects, 
looking for the best matching between the image to 
be processed and the reference one. Successively the 
difference image D feeds the Defect Detection module. 
This part aims at the detection of the presence oi flaws 
on the product under test and gives an output image 
containing only the defects. The output image allows 
to activate alarming systems able to detect the presence 
oi flaws, making this industrial task easier, in fact it 
could support experts in their diagnoses. The detailed 
implementation of each module will be illustrated in 
the second part of this contribution. 



FUTURE TRENDS 

In order to provide the most information related to 
defects detected by the proposed approach in industrial 
processes, features oi flaws should be identified. For this 
reason, future works will be devoted to the evaluation of 
different characteristics like dimension of defects, kind 
of damage and its degree. Moreover, the advantages 
by applying the proposed method in various industrial 
fields will be investigated and techniques minimizing 
eventual misclassifications in particular applications, 
will be developed. 
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CONCLUSION 

In this chapter a CNN-based method for the visual 
inspection of surface flaws of manufactures has been 
proposed. The approach consists of three modules: 
a Preprocessing Module provides images, in which 
contrast is enhanced. An Image Matching Module al- 
lows to make up for eventual misalignment between 
the manufacture under test and the acquisition system. 
Finally, the Defect Detection Module enables to extract 
images which contain defects of manufactures . 

The suggested method offers attractive advantages. 
It reveals general, therefore it can be introduced in 
different industrial fields, in which the identification 
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of superficial anomalies like dents, corrosions or spots 
on manufactures is a fundamental task. 

Moreover, the suggested method is finalized to the 
implementation by means of an architecture, entirely 
formed by Cellular Neural Networks, exploiting the po- 
tentialities that this kind of network offers in processing 
signals. Therefore, the proposed approach enables to 
automate in line diagnosis processes reducing operators' 
burden in identifying production defects. 
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KEY TERMS 

Artificial Neural Networks : A set of basic process- 
ing units which communicate to each other by weighted 
connections. These units give rise a parallel processing 
with particular properties such as the ability to adapt 
or learn, to generalise, to cluster or organise data, to 
approximate non-linear functions. Each unit receives 
an input from neighbours or external sources and uses it 
to compute an output signal. Such signal is propagated 
to other units or is a component of the network output. 
In order to map an input set into an output one a neural 
network is trained by teaching patterns, changing its 
weights according to proper learning rules. 



Automated Visual Inspection: An automatic form 
of quality control normally achieved using one or more 
cameras connected to a processing unit. Automated 
Visual Inspection has been applied to a wide range of 
products. Its target consists of minimizing the effects 
of visual fatigue of human operators who perform the 
defect detection in a production line environment. 

Cellular Neural Networks: A particular circuit 
architecture which possesses some key features of 
Artificial Neural Networks. Its processing units are 
arranged in an MxN grid. The basic unit of Cellular 
Neural Networks is called cell and contains linear and 
non linear circuit elements. Each cell is connected only 
to its neighbour cells. The adjacent cells can interact 
directly with each other, whereas cells not directly 
connected together may affect each other indirectly 
because of the propagation effects of the continuous 
time dynamics. 

Defect Detection: Extraction of information about 
the presence of an instance in which a requirement is not 
satisfied in industrial processes. The aim of Defect De- 
tection consists of highlighting manufactures which are 
incorrect or missing functionality or specifications. 

Image Matching: Establishment of the corre- 
spondence between each pair of visible homologous 
image points on a given pair of images, aiming at the 
evaluation of novelties. 

Industrial Inspection: Analysis pursuing the 
prevention of unsatisfactory industrial products from 
reaching the customer, particularly in situations where 
failed manufactures can cause injury or even endanger 
life. 

Region of Interest: A selected subset of samples 
within a dataset identified for a particular purpose. In 
image processing, the Region of Interest is identified by 
the boundaries of an object. The encoding of a Region 
of Interest can be achieved by basing its choice on: 
(a) a value that may or may not be outside the normal 
range of occurring values; (b) purely separated graphic 
information, like drawing elements; (c) separated 
semantic information, such as a set of spatial and/or 
temporal coordinates. 
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INTRODUCTION 



BACKGROUND 



Automatic visual inspection takes a relevant place in 
defect detection of industrial production . In this field a 
fundamental role is played by methods for the detection 
of superficial anomalies on manufactures. 

In particular, several systems have been proposed in 
order to reduce the burden of human operators, avoid- 
ing the drawbacks due to the subjectivity of judgement 
criteria (Kwak, Ventura & Tofang-Sazi 2000, Patil, 
Biradar & Jadhav 2005). 

Proposed solutions are required to be able to handle 
and process a large amount of data. For this reason, 
neural networks-based methods have been suggested 
for their ability to deal with a wide spread of data 
(Kumar 2003, Chang, Lin & Jeng 2005, Garcia 2005, 
Graham, Maas, Donaldson & Carr 2004, Acciani, 
Brunetti & Fornarelli 2006). Moreover, in many cases 
these methods must satisfy time constrains of industrial 
processes, because the inclusion of the diagnosis inside 
the production process is needed. 

To this purpose, architectures, based on Cellular 
Neural Networks (CNNs), revealed successful in the 
field of real time defect detection , due to the fact that 
these networks guarantee a hardware implementation 
and massive parallelism (Bertucco, Fargione, Nunnari 
& Risitano 2000), (Occhipinti, Spoto, Branciforte & 
Doddo 2001), (Perfetti & Terzoli 2000). On the basis 
of these considerations, a method to identify superfi- 
cial damages and anomalies in manufactures has been 
given in (Fornarelli & Giaquinto 2007). This method 
is aimed at the implementation by means of an archi- 
tecture entirely formed by Cellular Neural Networks, 
whose synthesis is illustrated in the present work. The 
suggested solution reveals effective for the detection 
of defects, as shown by two test cases carried out on 
an injection pump and a sample textile. 



In the companion paper an approach for defect detec- 
tion of surface flaws on manufactures is proposed: this 
approach can be divided into three modules, named 
Preprocessing module, Image Matching module and 
Defect Detection module, respectively. The first one 
realizes a pre-processing stage which enables to identify 
eventual defected areas; in the second stage the match- 
ing between such pre-processed image and a reference 
one is performed; finally, in the third step an output 
binary image, in which only defects are represented, 
is yielded. 

The proposed solution needs nor complex acquisi- 
tion system neither feature extraction, in fact the image 
is directly processed and the synthesis parameters of the 
networks are evaluated from the statistical image prop- 
erties automatically. Furthermore, the proposed system 
is well suited for a single board implementation . 



CNN-BASED DIAGNOSIS 
ARCHITECTURE 

The detailed implementation of each module will be 
illustrated in the following. Successively the results 
obtained by testing the suggested architecture on two 
real cases are shown and a discussion of numerical 
outcomes is reported. 

Preprocessing Module 

The Preprocessing is realized by a Fuzzy Contrast 
Enhancement block. This block consists of a Fuzzy 
Associative Memory (FAM), developed as the preproc- 
essing stage of the CNN-based system considered in 
(Carnimeo & Giaquinto 2002). The proposed circuit 
enables to transform 256-gray levels images into fuzzi- 
fied ones, whose contrast is enhanced, due to the fact 
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that their histogram s are stretched. To this purpose a 
proper fuzzification procedure is developed to define 
two fuzzy subsets adequate to describe the semantic 
content of patterns such as images of industrial obj ects, 
which can be classified as belonging to the Object/ 
Background class. 

In an analogous way, the domain of output values 
has been characterized by means of two output fuzzy 
subsets defined as Dark and Light. In particular, the 
fuzzy rules which provide the mapping from original 
images (O/R) into fuzzified ones (Op/Rp) can be ex- 
pressed as: 

IF 0(i, j) e Object THEN O f (i, j) e Dark 

IF 0(z,j) g Background THEN O f (z,j) g Light 



where 0(z, j) and O p (z, j) denote the gray level value 
of the (z, j)-th pixel in the original image and in the 
fuzzified one, respectively. As showed in (Carnimeo 
& Giaquinto 2002), the reported fuzzy rules can be 
encoded into a single FAM. 

Then, a Cellular Neural Network is synthesized to 
behave as the codified FAM by adopting the synthesis 
procedure developed in (Carnimeo & Giaquinto 2002), 
where the synthesis of a CNN-based memory, which 
contains the abovementioned fuzzification rules is ac- 
curately formulated. 

Contrasted images present a stretched histogram . 
This implies that such operation minimizes the effects 
of image noise, caused by environmental problems 
like dust or dirtiness of camera lenses. Moreover, it 
reduces the undesired information due to the combina- 
tion between the non uniformity of the illumination in 
the image and the texture of the manufacture (Jamil, 
Bakar, Mohd, & Sembok 2004). 

Image Matching Module 

In Figure 1 the block diagram corresponding to the 
Image Matching module is reported. The target of this 
module consists of finding the best matching between 
the images yielded by processing the acquired image 
and the reference one. To this purpose the image O p 
is shifted by one pixel into the four cardinal directions 
(NORTH; SOUTH, EAST and WEST), using four 
space-invariant CNNs (T. Roska, L. Kek, L. Nemes, 
A. Zarandy & P. Szolgay 1999) and obtaining the im- 
ages O fn , O ps , O fe and O pw . Successively the switch 



S 1 changes its position, excluding the image O p . The 
reference image Rp is subtracted by the images O fn , 

FS> FE FW aild °F> then the nUmber b N> b S> b E> b W 

and b of black pixels in the resulting images D N , D s , 
D E , D w and D Q are computed. The image, which best 
matches with the reference one, presents the maximum 
numbers of black pixels. Therefore, such value drives 
the switch S 2 , which allows to feedback the image 
that best matches with the reference one. In this way 
the image which presents the minimum difference 
becomes the input for a successive computational step. 
The processing is repeated until D presents the best 
matching . When this condition is satisfied, the difference 
image D between D Q and R F is computed. As it can 
be noticed the operations needed for each directional 
shift can be carried on simultaneously, reducing the 
computational time at each step. 

Defect Detection Module 

The third part of the suggested architecture is a Defect 
Detection module. The subsystem is synthesized with 
the aim of computing the output binary image F, in 
which only the defects are present. Such module is 
composed by the sequence of a Major Voting circuit, 
a CNN associative memory for contrast enhancement 
and a Threshold circuit. The corresponding CNN-based 
implementation is obtained by considering space in- 
variant networks. 

In detail the Major Voting circuit minimizes the 
number of false detections caused by the presence of 
noise, highlighting the dents or those flaws which lead 
to a changing in the reflectance of light in the original 
image. The output of the Major Voting block D M feeds 
a CNN working as the associative memory described 
in the previous Preprocessing module subsection. 
This operation provides an output image D MF whose 
histogram is bimodal. In this kind of image the selec- 
tion of a threshold, which highlights the defects, results 
feasible. In fact, a proper value is given by the mean 
of the modes of the histogram . Then, this image is 
segmented by means of the corresponding space-invari- 
ant CNN (T. Roska, L. Kek, L. Nemes, A. Zarandy & 
P. Szolgay 1999), obtaining the corresponding binary 
image F. In this way errors corresponding to incorrect 
identification of defects are minimized because only 
flaws are visible after the segmentation . 
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Figure 1. Block diagram corresponding to the Image Matching module 
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Numerical Examples 

The capabilities of the designed CNN-based architec- 
ture have been investigated on images representing the 
central part of injection pumps containing the Region 
of Interest (ROI), that is the flange, like the reported in 
Figure 2(a), whose histogram is shown in Figure 2(b). 
As it can be observed, this image presents two dents on 
the left and the bottom of the observed region. Dents 
are due to the collisions that can occur when pumps 
are moved among the different production locations 
during the various stages of the mounting. This image 
and the reference one are firstly processed by a circuit 
based on two (4x4)-cell CNNs described in the previ- 
ous subsection. In Figures 3(a-b) the corresponding 
output image yielded by the synthesized CNNs and its 
histogram are shown. It can be noticed that contrast is 
highly enhanced and the histogram is stretched. 

In figure 4(a) the output D of the Image Matching 
module is reported: impulsive-like noise due to the 
shifting or the imperfect lighting of the image or the 
reflection due to dirtiness is still present at this step. 

Finally, D feeds the Defect Detection module: in 
Figures 4(b) and 4(c) the output of Major Voting block 
D M and the final image F are shown, respectively. As 
it can be observed in image D M , the effects of irregular 
lighting or changing in reflections due to the dust or 



dirtiness are minimized. The results are encouraging, 
in fact the designed cellular system provides an out- 
put image (see Figure 4(c)), in which the areas of the 
manufacture with defects are well visible and detected 
by white pixels. 

Performances of the proposed system have been 
tested by means of a second experiment carried out on a 
sample textile. This industrial field has been investigated 
because time constraints of an automated defect detec- 
tion system in the textile industry are of crucial impor- 
tance (R. Stojanovic, P. Mitropulos, C.Koullamas, Y. 
Karayiannis, S. Koubias & G.Papadopoulos 2001). 

In Figure 5 (a) the acquired image of the textile is 
reported. In this case the whole image of the manu- 
facture coincides with the ROI. It can be noted that a 
bright vertical thin bar compares in the middle of the 
image. It corresponds to a lacked stamp. In the reported 
example the identification of defects constitutes anon 
trivial problem. In fact, the stamped areas have vari- 
egated geometric shapes, which can be depicted with 
a great number of different gray levels. The previously 
reported method has been applied to detect such kind 
of defects, yielding an image in which the only defect 
(the thin bar) is represented, similarly to the test case 
reporting dents of injection pumps. In Figure 5(b)- 
(c)-(d) the corresponding outputs of the Preprocessing 
module, the Image Matching module and the Defect 
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Figure 2. (a) Acquired Image containing the ROI (the flange of injection pump) with two dents; (b) gray-scale 
histogram of the image in Figure 2(a) 





(a) 



(b) 



Figure 3. (a) Output image of the Fuzzy Contrast Enhancement module fed by the image in Figure 2(a); (b) 
gray-scale histogram of the image in Figure 3(a) 
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Detection one are shown, respectively. It can be noticed 
that the central defect has been isolated effectively, 
even if a percentage of the areas to be identified is 
missed. This is due to the fact that, when details need 
to be detected, it is required that contrast is maximum. 
Nevertheless, as the contrast is increased, the histogram 
of the resulting image is emphasized toward extreme 
values of gray levels with respect to the acquired im- 
age. This implies that, due to a saturation phenomenon, 



an information loss about details takes place. (Brendel 
& Roska 2002). 

Finally, in the output image small white areas are 
misclassified as defects. This problem rises from the 
shift of the manufacture respect to the acquisition sys- 
tem. The Image Matching module minimizes the effects 
of such problem, but it can not delete them completely 
when mechanical deformations of the manufacture oc- 
cur as in the textile field. As it is shown in Figure 5(d), 



214 



AVI of Surface Flaws on Manufactures II 



Figure 4. (a) Output of the image matching module; (b) output of the major voting block in the defect detection 
module; (c) output image containing the detected defects, represented by white pixels 





(a) 



(b) 



(c) 



Figure 5. (a) Acquired image of a textile containing a thin bar, (b) corresponding output of the fuzzy contrast 
enhancement module; (c) output of the image matching module, (d) final output image 
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(d) 



this implies the presence of false positives, which will 
be investigated subsequently. 



FUTURE TRENDS 

As it can be argued from an observation of obtained 
numerical results, future works will be devoted to a more 
detailed analysis of misclassifications. In particular, 



false positives could be analyzed by means of further 
techniques which relate the characteristics of the pos- 
sible defected zones and the ones containing effective 
defects according to the constraints of the application. 
For instance, in the reported numerical examples the 
false positives have geometric sizes which are negligible 
if compared to the areas of eventual flaws. Therefore, a 
control of area dimensions could enable to discriminate 
the two kinds of regions. 
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CONCLUSION 

In this paper a CNN-based architecture for the vi- 
sual inspection of surface flaws of manufactures has 
been proposed. The architecture consists of modules, 
which are entirely realized by well-established cir- 
cuit networks. The reported design approach offers 
some interesting advantages. The proposed solution 
needs nor complex acquisition system neither feature 
extraction, in fact images are directly processed and 
the synthesis parameters, like thresholds for image 
segmentation, are evaluated from the statistical image 
properties automatically. Furthermore, due to the pos- 
sible hardware implementation of CNNs the resulting 
system can satisfy urgent time constrains relating to 
the in-line detection of some industrial productive pro- 
cesses, allowing the inclusion of the diagnosis inside 
the production steps. 
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KEY TERMS 

Automated Visual Inspection: An automatic form 
of quality control normally achieved using one or more 
cameras connected to a processing unit. Automated 
Visual Inspection has been applied to a wide range of 
products. Its target consists of minimizing the effects 
of visual fatigue of human operators who perform the 
defect detection in a production line environment. 

Cellular Neural Networks: A particular circuit 
architecture which possesses some key features of 
Artificial Neural Networks. Its processing units are 
arranged in an MxN grid. The basic unit of Cellular 
Neural Networks is called cell and contains linear and 
non linear circuit elements. Each cell is connected only 
to its neighbour cells. The adjacent cells can interact 
directly with each other, whereas cells not directly 
connected together may affect each other indirectly 
because of the propagation effects of the continuous 
time dynamics. 

Fuzzy Associative Memory: Akind of content-ad- 
dressable memory in which the recall occurs correctly 
if input data fall within a specified window consisting 
of an upper bound and a lower bound of the stored 
patterns. A Fuzzy Associative Memory is identified 
by a matrix of fuzzy values. It allows to map an input 
fuzzy set into an output fuzzy one. 

Histogram Stretching: A point process that in- 
volves the application of an appropriate transformation 
function to every pixel of a digital image in order to 



redistribute the information of the histogram toward 
the extremes of a grey level range. The target of this 
operation consists of enhancing the contrast of digital 
images. 

Image Matching: Establishment of the corre- 
spondence between each pair of visible homologous 
image points on a given pair of images, aiming at the 
evaluation of novelties. 

Major Voting: An operation aiming at deciding 
whether the neighbourhood of a pixel in a digital 
image contains more black or white pixels, or their 
number is equal. This effect is realized in two steps. 
The first one gives rise to an image, where the sign 
of the rightmost pixel corresponds to the dominant 
colour. During the second step the grey levels of the 
rightmost pixels are driven into black or white values, 
depending on the dominant colour, or they are left 
unchanged otherwise. 

Real Time System: A system that must satisfy 
explicit bounded response time constraints to avoid 
failure. Equivalently, a real-time system is one whose 
logical correctness is based both on the correctness of 
the outputs and its timeliness. The timeliness constraints 
or deadlines are generally a reflection of the underlying 
physical process being controlled. 

Region of Interest: A selected subset of samples 
within a dataset identified for a particular purpose. In 
image processing, the Region of Interest is identified by 
the boundaries of an object. The encoding of a Region 
of Interest can be achieved by basing its choice on: 
(a) a value that may or may not be outside the normal 
range of occurring values; (b) purely separated graphic 
information, like drawing elements; (c) separated 
semantic information, such as a set of spatial and/or 
temporal coordinates. 
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INTRODUCTION 

Since its seminal publication in 1988, the Cellular 
Neural Network (CNN) (Chua & Yang, 1988) paradigm 
have attracted research community's attention, mainly 
because of its ability for integrating complex comput- 
ing processes into compact, real-time programmable 
analogic VLSI circuits (Rodriguez et a/., 2004). 

Unlike cellular automata, the CNN model hosts 
nonlinear processors which, from analogic array 
inputs, in continuous time, generate analogic array 
outputs using a simple, repetitive scheme controlled 
by just a few real-valued parameters. CNN is the core 
of the revolutionary Analogic Cellular Computer, a 
programmable system whose structure is the so-called 
CNN Universal Machine (CNN-UM) (Roska & Chua, 
1993). Analogic CNN computers mimic the anatomy 
and physiology of many sensory and processing organs 
with the additional capability of data and program stor- 
ing (Chua & Roska, 2002). 

This article reviews the main features of this Artifi- 
cial Neural Network (ANN) model and focuses on its 
outstanding and more exploited engineering applica- 
tion: Digital Image Processing (DIP). 



BACKGROUND 

In the following paragraphs, a definition of the param- 
eters and structure of the CNN is performed in order to 
clarify the practical usage of the model in DIP. 

The standard CNN architecture consists of an M x 
N rectangular array of cells C(i,j) with Cartesian coor- 
dinates (i,j), i = 1, 2, . . ., M, j = 1, 2, . . ., N. Each cell or 
neuron C(i,j) is bounded to a connected neighbourhood 
or sphere of influence S r (i,j) of positive integer radius 
r, which is the set of all neighbouring cells satisfying 
the following property: 



S r (iJ)= C(k,/) 



max {\ k -U-J\h r \ 



l<k<M,l<l<N 



(1) 



This set is sometimes referred as a (2r +1) x (2r +1) 
neighbourhood, e.g., for a 3 x 3 neighbourhood, r should 
be 1. Thus, the parameter r controls the connectivity of 
a cell, i.e. the number of active synapses that connects 
the cell with its immediate neighbours. 

When r > N 12 and M = N, a fully connected CNN 
is obtained, where every neuron is connected to every 
other cell in the network and S r (i,j) is the entire array. 
This extreme case corresponds to the classic Hopfield 
ANN model (Chua & Roska, 2002). 

The state equation of any cell C(i,j) in the M x N 
array structure of the standard CNN may be described 
mathematically by: 



dt 



;Ztj(t)+ 2 [^U;W)-y«(0 + B(U;M) ■**]+!,, 



C(k,l)eS r (i,j) 



(2) 



where C and R are values that control the transient 
response of the neuron circuit (just like an RC filter, 
typically set to unity for the sake of simplicity), I is 
generally a constant value that biases or thresholds the 
state matrix Z = {z..}, and S r is the local neighbourhood 
of cell C(i, j) defined in (1), which controls the influ- 
ence of the input data X = {x..} and the network output 
Y- {y.} for time t. 

This means that both input and output planes in- 
teract with the state of a cell through the definition 
of a set of real-valued weights, A(i, j; k, /) and B(i, j; 
k, /), whose size is determined by the neighbourhood 
radius r. The matrices or cloning templates A and B 
are called the feedback and feed-forward (or control) 
operators, respectively. 

A standard CNN is typically defined with constant 
values for r, I, A and B, thus implying that for a fixed 
input image X, a neuron C(i, j) is provided for each 
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pixel (i,j), with constant weighted circuits defined by 
the feedback template A that connects the cell with the 
output plane Y, and by the control template B, which 
connects the neuron to the neighbouring pixels of 
input x.. g X. The value of the neuron state z is then 

r y y 

adjusted with the bias parameter I, and passed as input 
to a piecewise-linear function in order to determine the 
output value y... This function may be expressed as 



*=!(h(tH-k,(M) 



(3) 



In the Image Processing context, a grey-scale 
image input X can be represented pixel-wise using a 
linear map between a pixel value (e.g. a 8-bit integer 
luminance matrix with 256 grey-scale levels) and the 
CNN input interval [-1, +1], where the lower limit is 
used to implement full luminance (i.e. white) and the 
upper for black pixels (Chua & Yang, 1988). 



BASIC CNN IMAGE PROCESSING 

The main application of the CNN model, due to its 
convolution-like scheme, has been DIP modelling and 
design. In the next subsections a number of basic DIP 
approaches are introduced, underlining the importance 
of the network parameters by giving illustrative ex- 
amples of application. Starting from the standard model 
described in the previous section, the definition of the 
standard isotropic CNN follows. Then, an example of 
application in logic DIP processing is performed in 
order to introduce the nonlinear effects that implies 
the using a non-zero feedback template. 

The Isotropic CNN Model 

For a still image, X will be invariant with time, and 
for video, X = X(t). In the most general case, r, A, B 
and I may vary with position and time, and the cloning 
templates are defined as nonlinear, with the possibility 
of integrating inhibitory signals for the state matrix 
and even nonlinear templates that interact with mixed 
input-output-state data (Chua & Roska, 2002). 

These possible extensions raise the definition of a 
special (and simpler) class of CNN, called isotropic 
or space-invariant, in which r, A, B and I are fixed for 
the whole network and where linear synaptic operators 
are utilized. 



In other words, 

X A(i,j;k,l)-y kl = X Z A(i-Kj-l)-y kl 

C(k,l)GS r (i,j) \k-i\<r \l- j\<r 

X B(i,j;k,l).x kl = X X B (i-k,j-l)-x kl 



C(k,l)cS r (i,j) 



and I = I. 

y 



\k-i\<r |/-;'|<r 



(4) 



The vast majority of the templates defined in the 
template compendium of (Chua & Roska, 2002) for the 
CNN-UM are based on this isotropic scheme, using r 
= 1, and binary images in the input plane. 

If no feedback (i.e. A = 0) is used, then the CNN 
behaves as a convolution network, using B as a spatial 
filter, I as a threshold and the piecewise linear output 
(3) as a limiter or saturated output filter. In this way, 
virtually any spatial filter from DIP theory (Jain, 1989) 
can be implemented on such a feed-forward driven 
CNN, which ensures its output stability. 

For instance, the EDGE template defined by 



A = 0, B 



EDGE 



1 


-1 


-1 


1 


8 


-1 


1 


-1 


-1 



,/ = -! 



(5) 



is designed to work correctly for binary inputs, giving 
black (+1) output pixels in the input locations where a 
black edge pixel exists (i.e. if a black pixel has 1 white 
neighbour), and white (-1) pixels elsewhere. 

However, when a grey-scale input image is fed to 
this CNN, the output may not be a binary image. To 
solve this potential problem, the following modification 
is performed over the EDGE CNN: 



A = 2, B = B FnnF , I = -0.5 



(6) 



The definition of a centre feedback absolute value 
greater than 1 in (6) ensures a binary output and thus 
output network stability. The B template used in these 
CNN is of the Laplacian type, having the important 
property that all surrounding input synaptic weights are 
inhibitory (i.e. negative) and identical, but the centre 
synaptic weight is excitatory, and the average of all 
input synaptic weights is zero. 
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Apart from edges, convex corners (i.e. black pixels 
with at least five white neighbours) may also be detected 
with the following modification of its parameters: 



A = 2 > B = B EDGE > I= -^ 



(7) 



This example illustrates the important role played 
by the threshold parameter I. This parameter may be 
viewed as a bias index that reallocates the origin z Q of 
the output function (3) (Fernandez et a/., 2006). 

Basic Logic Operators 

In order to perform pixel- wise logic operations between 
two binary images X 1 and X 2 , the initial state Z(0) of the 
network is also utilized as a variable (Chua & Roska, 
2002). In standard feed-forward driven CNN, this vari- 
able Z(0) is usually set to zero but it can also be used 
in order to obtain results valid for another applications, 
such as motion detection and estimation (Torralba & 
Herault, 1999). 

For example, for a binary set union (logic OR), the 
following templates are defined: 

X = X V B V Z(0) = X 2 , A = 3, B = 3, 1 = 2(8) 

whereas for set intersection (logic AND), these vari- 
ables are defined as 

X = X ± , Z(0) = X 2 , A = 1.5, B = 1.5, 1 = -1.5 

(9) 

Once again, the usage of excitatory feedback ensures 
output stability through the saturation output function 
(3), and the threshold properly biases the final result. 

Feedback-Driven Standard CNN 

The feedback templates used in all the previously exem- 
plified CNN utilize (if any) only the central element of 
the template. A standard CNN with off-centre nonzero 
feedback elements is a special class that exhibits more 
complex dynamics than those treated so far (Chua & 
Roska, 1993). 

The use of a centre element in A, a Q0 > 1, means 
that the output will be binary, i.e. network output will 
never be stable in the linear region of the saturation 
function (3) (Chua & Roska, 2002). With this restric- 
tion, if another element is set in the feedback template, 



then two possible situations may occur: the activation 
of cells in the opposite part of only one of the satura- 
tion regions (partial inversion), or wave propagating 
cell inversions in both binary states. 

The first kind of these feedback-driven CNN is 
said to have the mono-activation property if cells in 
only one saturated region can enter the linear region. 
Thus, if cells can enter the linear region from the posi- 
tive saturation region, then those cells saturated in the 
negative part must fulfil that the overall contribution 
of A, B and I in its sphere of influence S r must be less 
than-1. That is, 

% (0 = £ hi • y» (0 + K ■ *« ]+ h < -i 

s r (U) 

(10) 

On the other hand, if cells enter the linear region only 
from the negative saturation region, then the contribu- 
tion for positive stable cells must be w..(t) > 1. It can 
be demonstrated that in a mono-activated CNN with 
positive A coefficients, with a QQ > 1 and saturated initial 
values, all the cells that enter the linear region change 
monotonically their state from (only) one saturated 
area to the other, and therefore it is a stable nonlinear 
network (Chua & Roska, 2002). 

If, for instance, one element in A is negative, 
the transient will not be monotonic, which does not 
necessarily imply network instability. An example of 
a non-monotonic but stable CNN is the Connected 
Component Detector (CCD) (Matsumoto et a/., 1990 
a) whose templates (for the horizontal case) are the 
following: 



^\:cd 











1 2 


-1 


,B = 0,1 = 











(11) 



For designing a unidirectional wave propagating 
mono-activated CNN, a binary activation pattern is 
defined, which will trigger the transient until output 
stability is reached (Chua & Roska, 2002). An ex- 
ample of this type of stable feedback-driven CNN is 
the (horizontal) Shadow Detector (Matsumoto et ah, 
1990 b) whose parameters are: 
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A, 



hadow 



"0 





0" 


1 


2 















, B = 0, 1 = 



(12) 



FUTURE TRENDS 

There is a continuous quest by engineers and special- 
ists: compete with and imitate nature, especially some 
"smart" animals. Vision is one particular area which 
computer engineers are interested in. In this context, the 
so-called Bionic Eye (Werblin et al., 1995) embedded 
in the CNN-UM architecture is ideal for implementing 
many spatio-temporal neuromorphic models. 

With its powerful image processing toolbox and 
a compact VLSI implementation (Rodriguez et al, 
2004), the CNN-UM can be used to program or mimic 
different models of retinas and even combinations of 
them (Lazar et al, 2004). Moreover, it can combine 
biologically based models, biologically inspired mod- 
els, and analogic artificial image processing algorithms. 
This combination will surely bring a broader kind of 
applications and developments. 



CONCLUSION 

A number of other advances in the definition and char- 
acterization of CNN have been researched in the past 
decade. This includes the definition of methods for 
designing and implementing larger than 3x3 neigh- 
bourhoods in the CNN-UM (Kek & Zarandy, 1998), 
the efficient implementation of halftoning techniques 
(Crounse etal, 1993), the CNN implementation of some 
image compression techniques (Venetianer etal., 1995) 
or the design of a CNN-based Fast Fourier Transform 
algorithm over analogic signals (Perko et al., 1998), 
between many others. Some of them have also been 
described in this book in the article entitled Advanced 
Cellular Neural Networks Image Processing. 

In this article, a general review of the main proper- 
ties and features of the Cellular Neural Network model 
has been addressed, focusing on its DIP capabilities 
from a basic viewpoint. CNN is now a fundamental 
and powerful toolkit for real-time nonlinear image 
processing tasks, mainly due to its versatile program- 
mability, which has powered its hardware development 
for visual sensing applications. 
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KEY TERMS 

Artificial Neural Network (ANN) : A system made 
up of interconnecting artificial neurons or nodes (usually 
simplified neurons) which may share some properties 
of biological neural networks. They may either be used 
to gain an understanding of biological neural networks, 
or for solving traditional artificial intelligence tasks 
without necessarily attempting to model a real biological 
system. Well known examples of ANN are the Hopfield, 
Kohonen and Cellular (CNN) models. 

Feedback: The signal that is looped back to control 
a system within itself. When the output of the system 
is fed back as a part of the system input, it is called 
a feedback loop. A simple electronic device which 
is based on feedback is the electronic oscillator. The 
Phase-Locked Loop (PLL) is an example of complex 
feedback system. 



Neuromorphic: A term coined by Carver Mead 
in the late 1980s to describe VLSI systems containing 
electronic analogue circuits that mimic neuro-biologi- 
cal architectures present in the nervous system. More 
recently, its definition has been extended to include both 
analogue, digital and mixed mode A/D VLSI systems 
that implements models of neural systems as well as 
software algorithms. 

Piece wise Linear Function: A function f(x) that 
can be split into a number of linear segments, each of 
which is defined for a non-overlapping interval of x. 

Spatial Convolution: A term used to identify the 
linear combination of a series of discrete 2D data (a 
digital image) with a few coefficients or weights. In 
the Fourier theory, a convolution in space is equivalent 
to (spatial) frequency filtering. 

Template: Also known as kernel, or convolution 
kernel, is the set of coefficients used to perform a spa- 
tial filter operation over a digital image via the spatial 
convolution operator. 

Transient: In electronics, a transient system is a 
short life oscillation in a system caused by a sudden 
change of voltage, current, or load. They are mostly 
found as the result of the operation of switches. The 
signal produced by the transient process is called the 
transient signal or simply the transient. Also, the tran- 
sient of a dynamic system can be viewed as its path to 
a stable final output. 

VLSI: Acronym that stands for Very Large Scale 
Integration. It is the process of creating integrated cir- 
cuits by combining thousands (nowadays hundreds of 
millions) of transistor-based circuits into a single chip. 
Atypical VLSI device is the microprocessor. 
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INTRODUCTION 

Numerical methods commonly employed to convert 
experimental data into interpretable images and spectra 
commonly rely on straightforward transforms, such as 
the Fourier transform (FT), or quite elaborated emerg- 
ing classes of transforms, like wavelets (Meyer, 1993; 
Mallat, 2000), wedgelets (Donoho, 1996), ridgelets 
(Candes, 1998), and so forth. Yet experimental data are 
incomplete and noisy due to the limiting constraints of 
digital data recording and the finite acquisition time. 
The pitfall of most transforms is that imperfect data 
are directly transferred into the transform domain along 
with the signals of interest. The traditional approach to 
data processing in the transform domain is to ignore any 
imperfections in data, set to zero any unmeasured data 
points, and then proceed as if data were perfect. 

Contrarily, the maximum entropy (ME) principle 
needs to proceed from frequency domain to space (time) 
domain. The ME techniques are used in data analysis 
mostly to reconstruct positive distributions, such as im- 
ages and spectra, from blurred, noisy, and/or corrupted 
data. The ME methods maybe developed on axiomatic 
foundations based on the probability calculus that has a 
special status as the only internally consistent language 
of inference (Skilling 1989; Daniell 1994). Within its 
framework, positive distributions ought to be assigned 
probabilities derived from their entropy. 

Bayesian statistics provides a unifying and self- 
consistent framework for data modeling. Bayesian 
modeling deals naturally with uncertainty in data 
explained by marginalization in predictions of other 
variables. Data overfitting and poor generalization are 
alleviated by incorporating the principle of Occam's 
razor, which controls model complexity and set the 
preference for simple models (MacKay, 1992). Bayes- 
ian inference satisfies the likelihood principle (Berger, 
1985) in the sense that inferences depend only on the 
probabilities assigned to data that were measured and 
not on the properties of some admissible data that had 
never been acquired. 



Artificial neural networks (ANNs) can be concep- 
tualized as highly flexible multivariate regression and 
multiclass classification non-linear models. However, 
over-flexible ANNs may discover non-existent correla- 
tions in data. Bayesian decision theory provides means 
to infer how flexible a model is warranted by data and 
suppresses the tendency to assess spurious structure in 
data. Any probabilistic treatment of images depends on 
the knowledge of the point spread function (PSF) of 
the imaging equipment, and the assumptions on noise, 
image statistics, and prior knowledge. Contrarily, the 
neural approach only requires relevant training exam- 
ples where true scenes are known, irrespective of our 
inability or bias to express prior distributions. Trained 
ANNs are much faster image restoration means, espe- 
cially in the case of strong implicit priors in the data, 
nonlinearity, and nonstationarity. The most remarkable 
work in Bayesian neural modeling was carried out by 
MacKay (1992, 2003) and Neal (1994, 1996), who 
theoretically setup the framework of Bayesian learning 
for adaptive models. 



BACKGROUND 

Bayesian approach to image restoration is based on the 
assumption that all of the relevant image information 
may be stated in probabilistic terms and prior probabili- 
ties are known. The ME principle is optimally setting 
prior probabilities for positive additive distributions. 
Yet Bayes' theorem and the ME principle share one 
common future: the updating of a state of knowledge. In 
some cases, running Bayes' theorem in one hypothesis 
space and applying the ME principle in another lead 
to similar calculations. 

Neuromorphic and Bayesian modeling may appar- 
ently look like extremes of the data modeling spectrum. 
ANNs are non-linear parallel computational devices 
endowed with gradient descent algorithms trained by 
example to solve prediction and classification problems. 
In contrast, Bayesian statistics is based on coherent 
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inference and clear axioms. Yet both approaches aim to 
create models in agreement with data. Bayesian deci- 
sion theory provides intrinsic means to model ranking. 
Bayesian inference for ANNs can be implemented nu- 
merically by deterministic methods involving Gaussian 
approximations (MacKay, 1992), or by Monte-Carlo 
methods (Neal, 1996). Two features distinguish the 
Bayesian approach to learning models from data. First, 
beliefs derived from background knowledge are used to 
select a prior probability distribution for model param- 
eters. Secondly, predictions of future observations are 
performed by integrating the model's predictions with 
respect to the posterior parameter distribution obtained 
by updating this prior with new data. Both aspects are 
difficult in neural modeling: the prior over network 
parameters has no obvious relation to prior knowledge, 
and integration over the posterior is computationally 
demanding. The properties of priors can be elucidated 
by defining classes of prior distributions for net param- 
eters that reach sensible limits as the net size goes to 
infinity (Neal, 1994). The problem of integrating over 
the posterior can be solved using Markov chain Monte 
Carlo (Neal 1996). 

Bayesian Image Modeling 

The fundamental concept of Bayesian analysis is that 
the plausibility of alternative hypotheses {H ( } is 

represented by probabilities {% } /gN , and inference is 
performed by evaluating these probabilities. Inference 
may opperate on various propositions related in neural 
modeling to different paradigms. Bayes ' theorem makes 
no reference to any sample or hypothesis space, neither 
it determines the numerical value of any probability 
directly from available information. As a prerequisite 
to apply Bayes' theorem, a principle to cast available 
information into numerical values is needed. 

In statistical restoration of gray-level digital im- 
ages, the basic assumption is that there exists a scene 
adequately represented by an orderly array of N pixels. 
The task is to infer reliable statistical descriptions of im- 
ages, which are gray-scale digitized pictures and stored 
as an array of integers representing the intensity of gray 
level in each pixel. Then the shape of any positive, ad- 
ditive image can be directly identified with a probability 
distribution. The image is conceived as an outcome of 

a random vector f =if 19 f 2 , ..., f N } , given in the form 



of a positive, additive probability density function. 

Likewise, the measured data g = {g ly g 2 ,...g M } are 
expressed in the form of a probability distribution (Fig. 
1). Further assumption refers to image data as a linear 
function of physical intensity, and that the errors (noise) 
b is data independent, additive, and Gaussian with zero 
mean and known standard deviation a m , m = 1, 2, ...,M 
in each pixel. The concept of image entropy and the 
entropy alternative expressions used in image restora- 
tion are discussed by Gull and Skilling (1985). A brief 
review of different approaches based on ME principle, 
as well as a full Bayesian approach for solving inverse 
problems are due to Djafari (1995). 

Image models are derived on the basis of intui- 
tive ideas and observations of real images, and have 
to comply with certain criteria of invariance, that is, 
operations on images should not affect their likelihood. 
Each model comprises a hypothesis H with some free 
parameters w = (a,p,...) that assign a probability 

density P(f | w,H) over the entire image space and 
normalized to integrate to unity. Prior beliefs about the 
validity of H before data acquisition are embedded in 
P(H). Extreme choices for P(H) only may exceed the 

evidence P(f\H\ thus the plausibility P(H\ f ) of 

H is given essentially by the evidence P ( f \ H ) of the 
image f. Consequently, objective means for comparing 
various hypotheses exist. 

Initially, the free parameters w are either unknown 
or they are assigned very wide prior distributions. The 
task is to search for the best fit parameter set w Mp , which 
has the largest likelihood given the image. Following 
Bayes' theorem: 



P(w|f,H)= 



P(f \w,H}P(w\H) 
P(f\H) 



(1) 



where P(f | w,H) is the likelihood of the image f 
given w, p( w i jj\ is the prior distribution of w, and 

P(f | H) is the evidence for H. A prior P(w\H) 
has to be assigned quite subjectively based on our 

beliefs about images. Since P(w| f,H) is normal- 
ized to 1, then the denominator in (1) ought to satisfy 

P(f |H)=J P(f \w,H}P(w\H}dw. The inte- 
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Figure 1. Flowchart summarizing the forward and inverse problems 




EXPERIMENTS 



FORWARD PROBLEM 
< Theory of Experiment J 



INVERSE PROBLEM 

£ Data Processing ) 




THEORETICAL MODELS 



grant is often dominated by the likelihood in w Mp , so 
that the evidence of H is approximated by the best fit 

likelihood P(f |w MP ,H) times the Occam's factor 
(MacKay, 1992): 



P(f\H)=P(f\w MP ,H}P(w MP \H}Aw 

(2) 

Assuming uniform prior parameter distributions 
P(w|H) over all admissible parameter sets, then 

P(w MP ) = > an( 3 the evidence becomes: 



A w 



P(f\H)=P(f\w MP ,H} 



Aw 

A w 



(3) 



The ratio 



P(H\ f)ozP(f\H)P(H) 
Maximum Entropy Methods 



(4) 



Applying the ME principle amounts to assigning a 

distribution {P 1 ,P 2 ,...,P n } on some hypothesis space 
by the criterion that it shall maximize some form of 
entropy subject to constraints that express properties 
we wish the distribution to have, but are not sufficient 
to determine it. The ME methods require specifying in 
advance a definite hypothesis space which sets down 
the possibilities to be taken into consideration. They 
come out with a probability distribution, rather than a 
probability. The ME probability of a single hypothesis 
H that is not embedded in a space of alternative hy- 
potheses does not make any sense. The ME methods 
do not require for input the numerical values of any 
probabilities on that space, rather they assign numeri- 
cal values to available information as expressed by the 
choice of hypothesis space and constraints. 



Aw 

A w 

between the posterior accessible volume of the 
model's parameter space and the prior accessible 
volume prevents data overfitting by favoring simpler 
models. Further, Bayes' theorem gives the probability 
of H up to a constant: 



LINEAR IMAGING EXPERIMENTS 

In the widely spread linear case, where the iV-dimen- 
sional image vector /"consists of the pixel values of an 
unobserved image, and the M-dimensional data vector 
g is made of the pixel values of an observed image 
supposed to be a degraded version of f, and assuming 
zero-mean Gaussian additive errors: 
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9 = Rf+b 



(5) 



where the M x N matrix R stands for the PSF (transfer 
function or instrumetal response) of the imaging system; 
then the likelihood of data is: 



P(g\f,C,H)-- 



(2tt)2 -del 2 C 



exp\--(g-f) 1 C 1 (g-f) 



(6) 



where C is the covariance matrix of the error vector 
b. If there is no correlation among the pixels and each 
pixel has the standard deviationa m , m = l,2,...,M,then 
the symmetric full rank covariance matrix becomes 
diagonal with the elements C mm =o m > m =1,2, ...,M 
Hence the probability of the data g given the image f 
may be written as: 



P(g\f,C,H)= 



M M 

n 



( 2 *) 2 rK 



■exp 



1 M 

4z 



~/l R mnfn 



(7) 



The full joint posterior P ( f,0 | g, H ) of the image f 
and the unknown PSF parameters denoted generically 
by should be evaluated. Then the required inference 

about the posterior probability p(f | g,H) is obtained 
as a marginal integral of this joint posterior over the 
uncertainties in the PSF: 

P(f \g,H)=lp(f,Q\g,H}dQ=lP(f \Q,g,H}P(Q\g,H}dQ 

(8) 

Now applying Bayes' theorem for the parameters 
6: 

P(G|9 ' H) - P(g\H) 0) 

and substituting in (8) 

jP(f,d\g,H)dQKJP(f \e,g,H}p(g\Q,H}p(Q\H}dQ 

(10) 



If the evidence P (g 1 , H ) is sharply peaked around 

some value and the prior P (0 | H ) is fairly flat in that 
region, then P ( f \ g, H ) = P (f \ Q, g, H \ Otherwise, if 
the marginal integrant is not well approximated at the 
modal value of the evidence, then misleading narrow 
posterior probability densities may result. 

If the errors have uniform standard deviation a b , then 
the symmetric covariance matrix has full rank M with 

C =cj^I, and the probability of data (7) becomes: 



P(ff|f,P,H) = 



( M 



Z b (P) 



-exp 



Zfi -E b (g\f,H) 

(11) 



where (3 = l/cr^ is a measure of the noise in each 
pixel, 



EMm=\ h 4=\± 



7 j mn I n 



n=l 



m=l 



is the error function, and is the noise partition func- 
tion. 

More complex models use the intrinsic correlation 

function C = [GG r 1 , where G is a convolution from 
an imaginary nidden image, which is uncorrected, to 
the real correlated image. 

If the prior probability of the image f is also Gauss- 
ian: 



P(f|F ,H) = - 



1 



■exp 



(2jc)2 -det 2 F 



-\fWf 



(12) 



where is the prior covariance matrix of f, and assum- 
ing a uniform standard deviation of the image, then 
its prior probability distribution becomes: 

P(f\a,Hy-±—exp (-oE f (f |F )) 
z f \ a ) 

(13) 
where the parameter a =l/a 2 f measures the expected 
smoothness of f, Z f (a)=(2n;/a) is the partition 
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function of f, and 

E f (f\F )=±f T F^f. 

The posterior probability of image f given data g is 
derived from Bayes' theorem: 



P(f| ff ,a,p,H)= 



P(ff|f,p,ff>P(f |a,H) 



P(g|a,p,H) 



(14) 



where the evidence P(g | a, P,H) is the normalizing 
factor. Since the denominator in (14) is a product of 
Gaussian functions of f, we may rewrite: 



P(f| ft a,p,H) = 



exp (-q£ f - P£ b )_ exp (-M(f )) 



Z M (a,p) 



Z M (a, p) 
(15) 



where 



M(f) = aE f + pE 5 
andZ M (a,p)=| f exp (-M(f))df 

with the integral covering the space of all admissible im- 
ages in the partition function. Therefore, minimizing the 
obj ective function M(/) corresponds to finding the most 

probable image f MP , which is the mean value of the 
Gaussian posterior distribution. Its covariance matrix 
A -1 that defines the joint error bars on fcan be obtained 
from the Hessian matrix A = -VVlog P(f | g,a, p,H) 
evaluated at fMP. The image fMP is obtained by dif- 
ferentiating log P(f | g,a, P,H) and solving for the 
derivative being zero: 



fMP 



R r R 



R r f 



(16) 



The term 



cr 



regularizes the ill-conditioned inversability. When 
the term 



is negligible, the optimal linear filter 

i 



R R 



f 



R J 



equates to the pseudoinverse R" 1 = R T R R T . 

Entropic Prior of Images 

Invoking the ME principle requires that the prior 
knowledge to be stated as a set of constraints on f, 
though affecting the amount by which the image recon- 
struction is offset from reality. The prior information 
about f may be expressed as a probability distribution 
(Djafari, 1995): 

P(f|a,tf)=— !— exp (-aO(f)) (17) 

where a is generally a positive parameter and Z(ot) is 
the normalizing factor. The entropic prior in the discrete 
case may correspond to potential functions like: 



n=1 U 



(18) 



where U is the total number of quanta in the image f 
(Mutihac et a/., 1997). 

The posterior probability of an image f drawn from 
some measured data g is given by Bayes' theorem: 



P(f \g,a,C,H)ocexp \-a£f n -ln 



exp 



-i M 

4e 



-2Xf. 



(19) 

An estimation rule, such as posterior mean or 
maximum a posteriori (MAP), is needed in order to 

choose an optimal, unique, and stable solution f 
for the estimated image. The posterior probability is 
assumed to summarize the full state of knowledge on 
a given scene. Producing a single image as the best 
restoration naturally leads to the most likely one which 
maximizes the posterior probability P ( f \ g,a , C, H ), 
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along with some statement of reliability derived from 
the spread of all admissible images. 

In variational problems with linear constraints, 
Agmon et al. (1979) showed that the potential func- 
tion associated to a positive, additive image is always 
concave for any set of Lagrange multipliers, and it 
possesses an unique minimum which coincides with 
the solution of the nonlinear system of constraints. As a 
prerequisite, the linear independence of the constraints 
is checked and then the necessary and sufficient condi- 
tions for a feasible solution are formulated. Wilczek 
and Drapatz (1985) suggested the Newton-Raphson's 
iteration method as offering high accuracy results. 
Ortega and Rheinboldt (1970) adopted a continuation 
technique for the very few cases where the Newton's 
method fails to converge. These techniques are never- 
theless successful in practice for relatively small data 
sets only and assume a symmetric positive definite 
Hessian matrix of the potential function. 

Quality Assessment of Image 
Restoration 

In all digital imaging systems, quality degradation is 
inevitably due to various sources like photon shot noise, 
finite acquisition time, readout noise, dark current noise, 
and quantization noise. Some noise sources can be ef- 
fectively suppressed yet some cannot. The combined 
effect of these degradation sources is often modeled by 
Gaussian additive noise (Pham et al. 2005). 

In order to quantitatively estimate the restoration 
quality in the case of similar size (M = N) for both the 

measured g and the restored image f , the mean energy 
of restoration error: 



called blurred signal-to-noise ratio redefined here by 
using the noise variance in each pixel such as: 



BSNR =10 Ig 



1_ 

N 



N 

I 

n=l 



[y n -yn~f 



Or 



(21) 



where y = g-b is the difference between the measured 
data g and the noise b. 

In simulations, where the original image f of the 
measured data g is available, the objectivity of testing 
the performance of image restoration algorithms may 
be assessed by the improvement of signal-to-noise ratio 
metric defined as: 



ISNR = 10 -lg 



TXfn-9n\ 

n=l 

t[fn-fn] 2 



(22) 



where f is the best statistical estimation of the cor- 
rect solution f. 

While mean squared error metrics like ISNR do not 
always reflect the perceptual properties of the human 
visual system, they may provide an objective stan- 
dard by which to compare different image processing 
techniques. Nevertheless, it is of major significance 
that various algorithms behavior be analyzed from 
the point of view of ringing and noise amplification, 
which can be a key indicator of improvement in quality 
for subjective comparisons of restoration algorithms 
(Banham and Katsaggelos, 1997). 



FUTURE TRENDS 



~ l2 



D -jI[a.-f.] 



(20) 



may be used as a merit factor. Yet too high a value 
for D may set the restored image quite away from 
the original scene and raise questions on introducing 
spurious features for which there is no clear evidence 
in measurements and complicating the subsequent 
inference and plausibility. 

A more realistic degradation measure of image blur- 
ring by additive noise is referred to in terms of a metric 



A practical Bayesian framework for neural-inspired 
modeling aims to develop probabilistic models that fit 
data and perform optimal predictions. The link between 
Bayesian inference and neural models gives new per- 
spectives to the assumptions and approximations made 
on ANNs when used as associative memories. Evolu- 
tionary optimization algorithms capable to discover 
absolute function minimum (maximum) are needed. 

A statistically biased redefinition of the concept 
of pattern existence used in a quantitative manner to 
assess the overall quality of digital images with do- 
main-specific relevance would increase the accuracy 
of ranking the image restoration methods. 
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An efficient MAP procedure has to be implemented 
in a recursive supervised trained neural net to get re- 
stored (reconstructed) the best image in compliance 
with the existing constraints, measuring and modeling 
errors. 



CONCLUSION 



A major intrinsic difficulty in Bayesian image restora- 
tion resides in determination of a prior law for images. 
The ME principle solves this problem in a self-consistent 
way. The ME model for image deconvolution enforces 
the restored image to be positive. The spurious negative 
areas and complementary spurious positive areas are 
wiped off and the dynamic range of the restored image 
is substantially enhanced. 

Image restoration based on image entropy is effec- 
tive even in the presence of significant noise, missing 
or corrupted data. This is due to the appropriate regu- 
larization of the inverse problem of image restoration 
introduced in a coherent way by the ME principle. It 
satisfies all consistency requirements when combining 
the prior knowledge and the information contained in 
experimental data. A major result is that no artifacts 
are added since no structure is enforced by entropic 
priors. 

Bayesian ME approach is a statistical method which 
directly operates in spatial domain, thus eliminating 
the inherent errors coming out from numerical Fou- 
rier direct and inverse transformations and from the 
truncation of signals. 
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KEY TERMS 

Artificial Neural Networks (ANNs): Highly 
parallel nets of interconnected simple computational 
elements, which perform elementary operations like 
summing the incoming inputs (afferent signals) and 
amplifying/thresholding the sum. 

Bayesian Inference: An approach to statistics in 
which all forms of uncertainty are expressed in terms 
of probability. 



Deconvolution: An algorithmic method for elimi- 
nating noise and improving the resolution of digital 
data by reversing the effects of convolution on recorded 
data. 

Digital Image: A representation of a 2D/3D image 
as a finite set of digital values called pixels/voxels 
typically stored in computer memory as a raster image 
or raster map. 

Entropy: A measure of the uncertainty associated 
with a random variable. Entropy quantifies information 
in a piece of data. 

Image Restoration: Ablurred image can be signifi- 
cantly improved by deconvolving its PSF in such a way 
that the result is a sharper and more detailed image. 

Point Spread Function (PSF): The output of the 
imaging system for an input point source. 

Probabilistic Inference: An effective approach to 
approximate reasoning and empirical learning in AI. 
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INTRODUCTION 



BACKGROUND 



The field of off-line optical character recognition (OCR) 
has been a topic of intensive research for many years 
(Bozinovic, 1989; Bunke, 2003; Plamondon, 2000; 
Toselli, 2004). One of the first steps in the classical 
architecture of a text recognizer is preprocessing, 
where noise reduction and normalization take place. 
Many systems do not require a binarization step, so the 
images are maintained in gray-level quality. Document 
enhancement not only influences the overall perfor- 
mance of OCR systems, but it can also significantly 
improve document readability for human readers. In 
many cases, the noise of document images is hetero- 
geneous, and a technique fitted for one type of noise 
may not be valid for the overall set of documents. One 
possible solution to this problem is to use several filters 
or techniques and to provide a classifier to select the 
appropriate one. 

Neural networks have been used for document 
enhancement (see (Egmont-Petersen, 2002) for a re- 
view of image processing with neural networks). One 
advantage of neural network filters for image enhance- 
ment and denoising is that a different neural filter can 
be automatically trained for each type of noise. 

This work proposes the clustering of neural network 
filters to avoid having to label training data and to 
reduce the number of filters needed by the enhance- 
ment system. An agglomerative hierarchical clustering 
algorithm of supervised classifiers is proposed to do 
this. The technique has been applied to filter out the 
background noise from an office (coffee stains and 
footprints on documents, folded sheets with degraded 
printed text, etc.). 



Multilayer Perceptrons (MLPs) have been used in previ- 
ous works for image restoration: the input to the MLP 
is the pixels in a moving window, and the output is the 
restored value of the current pixel (Egmont-Petersen, 
2000; Hidalgo, 2005; Stubberud, 1995; Suzuki, 2003). 
We have also used neural network filters to estimate 
the gray level of one pixel at a time (Hidalgo, 2005): 
the input to the MLP consisted of a square of pixels 
that was centered at the pixel to be cleaned, and there 
were four output units to gain resolution (see Figure 
1). Given a set of noisy images and their corresponding 
clean counterparts, a neural network was trained. With 
the trained network, the entire image was cleaned by 
scanning all the pixels with the MLP. The MLP, there- 
fore, functions like a nonlinear convolution kernel. The 
universal approximation property of a MLP guarantees 
the capability of the neural network to approximate any 
continuous mapping (Bishop, 1996). 

This approach clearly outperforms other classic 
spatial filters for reducing or eliminating noise from 
images (the mean filter, the median filter, and the clos- 
ing/opening filter (Gonzalez, 1993)) when applied to 
enhance and clean a homogeneous background noise 
(Hidalgo, 2005). 



BEHAVIOUR-BASED CLUSTERING OF 
NEURAL NETWORKS 

Agglomerative Hierarchical Clustering 

Agglomerative hierarchical clustering is considered to 
be a more convenient approach than other clustering 
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Figure 1. An example of document enhancement with an artificial neural network. A cleaned image (right) is 
obtained by scanning the entire noisy image (left) with the neural network. 
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A new offline handwritten database for the Spanish lan- 
guage, which contains full Spanish sentences, has recently 

been developed: the Spartacus database (which stands for 
Spanish Restrictcd-domain Task of Cursive Script). There 
were two main reasons for creating this corpus. First of all, 
most databases do not contain Spanish sentence^ even tho j|h 
Spanish is a widespread major language. Another important 
reason was to create a corpus from semantic-restricted tasks. 
These Elks are comrnonry used in practice and allow the use 
beyond the lexicon level in the recog 

gtauase consisted mainly of short sen- 
; paraeraphs, the writers were 

1 Fixed places: dedicated one- 
line fieJds m thkforms. Next Bjto shows one of the forms 
used in the acquMion process. TtteXfarms also contain a 
brief -..,'. of instructX 




algorithms, mainly because it makes very few assump- 
tions about the data (Jain, 1999; Mollineda, 2000). In- 
stead of looking for a single partition (based on finding 
a local minimum), this clustering algorithm constructs 
a hierarchical structure by iteratively merging clusters 
according to certain dissimilarity measure, starting from 
singletons until no further merging is possible (one 
general cluster). The hierarchical clustering process 
can be illustrated with a tree that is called dendogram, 
which shows how the samples are merged and the 
degree of dissimilarity of each union (see Figure 2). 
The dendogram can be easily broken at a given level 
to obtain clusters of the desired cardinality or with a 
specific dissimilarity measure. A general hierarchical 
clustering algorithm can be informally described as 
follows: 

1. Initialization: M singletons as M clusters. 

2. Compute the dissimilarity distances between 
every pair of clusters. 

3. Iterative process: 



a) Determine the closest pair of clusters z and j. 

b) Merge the two closest clusters into a new cluster 

c) Update the dissimilarity distances from the new 
cluster z+j to all the other clusters. 

d) If more than one cluster remains, go to step a). 

4. Select the number JV of clusters for a given crite- 
rion. 

Behaviour-Based Clustering of 
Supervised Classifiers 

When the points of the set to be clustered are supervised 
classifiers, both a dissimilarity distance and the way to 
merge two classifiers must be defined (see Figure 2): 

1. The dissimilarity distance between two clusters 
can be based on the behaviour of the classifiers 
with respect to a validation dataset. The more 
similar the output of two classifiers is, the closer 
they are. 

2. To merge the closest pair of clusters, a new clas- 
sifier is trained with the associated training data 
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Figure 2. Behaviour-based clustering of supervised classifiers. An example of the dendogram obtained for M= 5 
points: A, B, C, D, E. IfN=3, three clusters are selected: A+B, C, D+E. In this work, to merge two clusters, a new 
classifiers is trained. For example, cluster D+E is trained with the data used to train the classifiers D and E. 
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of both clusters. Another possibility is to build 
an ensemble of the two classifiers. 

An Application of Behaviour-based Clustering of 
MLPs to Document Enhancement 

In this work, MLPs are used as supervised classi- 
fiers. When two clusters are merged, a new MLP is 
trained with the associated training data of the two 
merged MLPs. 

This behaviour-based clustering algorithm has been 
applied to enhance printed documents with typical 
noises from an office (folded sheets, wrinkled sheets, 
coffee stains, ...). Figure 1 shows an example of a noisy 
printed document (wrinkled sheet) from the corpus. 

A set of MLPs is trained as neural filters for dif- 
ferent types of noise and then clustered into groups 
to obtain a reduced set of neural clustered filters. In 
order to automatically determine which clustered filter 
is the most suitable to clean and enhance a real noisy 
image, an image classifier is also trained using MLPs. 
Experimental results using this enhancement system 
show excellent results in cleaning noisy documents 
(Zamora-Martinez, 2007). 



The method proposed in this work can be improved 
twofold: by using ensembles of MLPs when two MLPs 
are merged, and by improving the method to select 
the neural clustered filter that is the most suitable to 
enhance a given noisy image. 



CONCLUSION 

An agglomerative hierarchical clustering of supervised- 
learning classifiers that uses a measure of similarity 
among classifiers based on their behaviour on a vali- 
dation dataset has been proposed. As an application 
of this clustering procedure, we have designed an 
enhancement system for document images using neural 
network filters. Both objective and subjective evalu- 
ations of the cleaning method show excellent results 
in cleaning noisy documents. This method could also 
be used to clean and restore other types of images, 
such as noisy backgrounds in scanned documents, 
stained paper of historical documents, vehicle license 
recognition, etc. 



FUTURE TRENDS 

Document enhancement is becoming more and more 
relevant due to the huge amount of scanned documents. 
Besides, it not only influences the overall performance 
of OCR systems, but it can also significantly improve 
document readability for human readers. 
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KEY TERMS 

Artificial Neural Network: An artificial neural 
network (ANN), often just called a "neural network" 
(NN), is an interconnected group of artificial neurons 
that uses a mathematical model or computational model 
for information processing based on a connectionist 
approach to computation. 

Backpropagation Algorithm: A supervised 
learning technique used for training artificial neural 
networks. It was first described by Paul Werbos in 
1974, and further developed by David E. Rumelhart, 
Geoffrey E. Hinton and Ronald J. Williams in 1986. 
It is most useful for feed- forward networks (networks 
that have no feedback, or simply, that have no connec- 
tions that loop). 

Clustering: The classification of objects into dif- 
ferent groups, or more precisely, the partitioning of 
a data set into subsets (clusters), so that the data in 
each subset (ideally) share some common trait - often 
proximity according to some defined distance measure. 
Data clustering is a common technique for statistical 
data analysis, which is used in many fields, including 
machine learning, data mining, pattern recognition, 
image analysis and bioinformatics. 

Document Enhancement: Accentuation of certain 
desired features, which may facilitate later processing 
steps such as segmentation or object recognition. 

Hierarchical Agglomerative Clustering: Hierar- 
chical Clustering algorithms find successive clusters 
using previously established clusters. Agglomerative 
algorithms begin with each element as a separate cluster 
and merge them into successively larger clusters. 

Multilayer Perceptron (MLP): This class of ar- 
tificial neural networks consists of multiple layers of 
computational units, usually interconnected in a feed- 
forward way. Each neuron in one layer has directed 
connections to the neurons of the subsequent layer. In 
many applications the units of these networks apply a 
sigmoid function as an activation function. 
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Optical Character Recognition (OCR): A type 
of computer software designed to translate images of 
handwritten or typewritten text (usually captured by 
a scanner) into machine-editable text, or to translate 
pictures of characters into a standard encoding scheme 
representing them (e.g. ASCII or Unicode). OCRbegan 
as a field of research in pattern recognition, artificial 
intelligence and machine vision. 

Supervised Learning: A machine learning tech- 
nique for creating a function from training data. The 
training data consist of pairs of input objects (typically 
vectors), and desired outputs. The output of the func- 
tion can be a continuous value (called regression), or 
can predict a class label of the input object (called 
classification). The task of the supervised learner is 
to predict the value of the function for any valid input 
object after having seen a number of training examples 
(i.e. pairs of input and target output). 
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INTRODUCTION 



BACKGROUND 



Large worldwide projects like the Human Genome 
Project, which in 2003 successfully concluded the 
sequencing of the human genome, and the recently 
terminated Hapmap Proj ect, have opened new perspec- 
tives in the study of complex multigene illnesses: they 
have provided us with new information to tackle the 
complex mechanisms and relationships between genes 
and environmental factors that generate complex ill- 
nesses (Lopez, 2004; Dominguez, 2006). 

Thanks to these new genomic and proteomic data, 
it becomes increasingly possible to develop new medi- 
cines and therapies, establish early diagnoses, and even 
discover new solutions for old problems. These tasks 
however inevitably require the analysis, filtration, and 
comparison of a large amount of data generated in a 
laboratory with an enormous amount of data stored in 
public databases, such as the NCBI and the EBI. 

Computer sciences equip biomedicine with an 
environment that simplifies our understanding of the 
biological processes that take place in each and every 
organizational level of live matter (molecular level, 
genetic level, cell, tissue, organ, individual, and popula- 
tion) and the intrinsic relationships between them. 

Bioinformatics can be described as the application 
of computational methods to biological discoveries 
(Baldi, 1998). It is a multidisciplinary area that includes 
computer sciences, biology, chemistry, mathematics, 
and statistics. The three main tasks of bioinformatics 
are the following: develop algorithms and mathematical 
models to test the relationships between the members 
of large biological datasets, analyze and interpret het- 
erogeneous data types, and implement tools that allow 
the storage, retrieve, and management of large amounts 
of biological data. 



The following section describes some of the problems 
that are most commonly found in bioinformatics. 

Interpretation of Gene Expression 

The expression of genes is the process by which the 
codified information of a gene is transformed into the 
necessary proteins for the development and function- 
ing of the cell. In the course of this process, small 
sequences of ARN, also called ARN messengers, are 
formed by transcription and subsequently translated 
into proteins. 

The amount of expressed mARN can be measured 
with various methods, such as gel electrophoresis, but 
large numbers of simultaneous expression analyses are 
usually carried out with microarrays (Quackenbush, 
2001), which make it possible to obtain the simultane- 
ous expression of tens of thousands of genes; such an 
amount of data can only be analyzed with the help of 
an informatic process. 

Among the most common tasks in this type of 
analysis is the task to find the differences between, for 
instance, a patient and a test that determines whether 
a gene is expressed or not. These tasks can be divided 
into classical problems of classification and cluster- 
ing. Clustering is used not only in experiments of 
microarrays (to identify groups of genes with similar 
expressions), but also suggests functional relationships 
between the members of the cluster. 

Alignment of ADN, ARN, and Protein 
Sequences 

Sequences alignment consists in superposing two or 
more sequences of both nucleotides (ADN and ARN) 
and amino acids (proteins) in order to compare them and 
analyze the sequence parts that are alike and unalike. 
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The optimal alignment is that which mainly shows cor- 
respondences between the nucleotides or amino acids 
and is therefore said to have the highest score. This 
alignment may or may not have a biological meaning. 
There are two types of alignment: the global alignment, 
which maximizes the number of coincidences in the 
entire sequence, and the local alignment, which looks 
for similar regions in large sequences that are normally 
highly divergent. The most commonly used technique 
to implement alignments is dynamic programming 
by means of the Smith-Waterman algorithm (Smith, 
1981), which explores all the possible comparisons in 
the sequences. 

Another problem in sequences alignment is multiple 
alignment (Wallace, 2005), which consists in aligning 
three or more sequences of ADN, ARN, or proteins, and 
is generally used to search for evolutive relationships 
between these sequences. The problem is equivalent 
to that of simple sequences alignment, but takes into 
consideration the n sequences that are to be compared. 
The complexity of the algorithm increases exponentially 
with the number of sequences to compare. 

Identification of the Gene Regulatory 
Network 

All the information of a living organism's genome is 
stored in each and every one of its cells. Whereas the 
genome is used to synthesize information on all the 
body cells, the regulating network is in charge of guid- 
ing the expression of a given set of genes in one cell 
rather than another so as to form certain types of cells 
(cellular differentiation) or carry out specific functions 
related to spatial and temporal localization; in other 
words, it makes the genes express themselves when 
and where necessary. The role of a gene regulatory 
network therefore consists in integrating the dynamic 
behaviour of the cell and the external signals with the 
environment of the cell, and to guide the interaction 
of all the cells so as to control the process of cellular 
differentiation (Geard, 2004). Inferring this regulating 
network from the cellular expression data is considered 
to be one of the most complex problems in bioinformat- 
ics (Akustsu, 1999). 

Construction of Phylogenetic Trees 

Aphylogenetic tree (Setubal, 1999) is a tree that shows 
the evolutionary relationships between various spe- 



cies of individuals that are believed to have common 
descendence. Whereas traditionally morphological 
characteristics are used to carry out such analyses, in 
the present case we will study molecular phylogenetic 
trees, which use sequences of nucleotides or amino 
acids for classification. The construction of these trees 
is initially based on algorithms for multiple sequences 
alignment, which allows us to classify the evolutive 
relationships between homologue genes present in 
various species. In a second phase, we must calculate 
the genetic distance between each pair of sequences in 
order to represent them correctly in the tree. 

Gene Finding and Mapping 

Gene finding (Fickett, 1996) basically consists in 
identifying genes in an ADN chain by recognizing the 
sequence that initiates the codification of the gene or 
gene promoter. When the protein that will interpret the 
gene finds the sequence of that promoter, we know that 
the next step is the recognition of the gene. 

Gene mapping (Setiibal, 1999) consists in creating 
a genetic map by assigning genes to a position inside 
the chromosome and by indicating the relative distance 
between them. There are two types of mapping. Physical 
or cytogenetic mapping, on the one hand, consists in 
dividing the chromosome into small labelled fragments. 
Once divided, they must be ordered and situated in their 
correct position in the chromosome. Link mapping, on 
the other hand, shows the position of some genes with 
respect to others. The latter mapping type has two in- 
conveniences: it does not provide the distance between 
the genes, and it is unable to provide the correct order 
if the genes are very close to each other. 

Prediction of DNA, RNA, and Protein 
Structure 

The DNA and RNA sequences are folded into a tridi- 
mensional structure that is determined by the order of 
the nucleotides within the sequence. Under the same 
environmental conditions, the tridimensional structure 
of these sequences implies a diverging behaviour. Since 
the secondary structure of the nucleic acids is a factor 
that affects the link of both DNA molecules and RNA 
molecules, it is essential to know these structures in 
order to analyze a sequence. 

The prediction of the folds that determine the RNA 
structure is an important factor in the understanding of 
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many biological processes, such as translation in the 
RNA Messenger, replication of RNA chains in viruses, 
and the function of structural RNA and RNA/proteins 
complexes. 

The tridimensional structure of proteins is extremely 
diverse, going from completely fibrous to nodular. 
Predicting the folds of proteins is important, because a 
protein's structure is closely related to its function. The 
experimental determination of the proteinic structure 
as such helps us to find the proteinic function and al- 
lows us to design synthetic proteins that can be used 
as medicines. 



BIO-INSPIRED ALGORITHMS 

The basic principle of bio-inspired algorithms is to 
use analogies with natural systems in order to solve 
problems. By simulating the behaviour of natural 
systems, these algorithms design heuristic, non-deter- 
ministic methods for searching, learning, behaviour, 
etc. (Forbes, 2004). 

Artificial Neural Networks 

Artificial neural networks (McCulloch, 1943)(Hertz, 
1991)(Bishop, 1995) (Rumelhart, 1986) (ANNs) are 
computational models inspired on the behaviour of 
the nervous system. Even though their development is 
based on the modelling of biological processes in the 
brain, there are considerable differences between the 
processing elements of ANNs and actual neurons. 

ANNs consist of unit networks that are intercon- 
nected and organized in layers that evolve in the course 
of time. The main features of these systems are the 
following: Self-Organization and Adaptability: Allow 
robust and adaptive processing, adaptive training, 
and self-organizing networks; Non-linear processing: 
Increase the network's capacity to approach, classify, 
and be immune to noise;Parallel processing: use a 
large number of processing units with a high level of 
interconnectivity. 

ANNs can be classified according to their learn- 
ing type: Supervised learning neural networks: the 
network learns relationships between the input and 
output data. The input data are passed on to the input 
layer and propagate through the network architecture 
until they reach the output layer. The output obtained 
in this output layer is compared to the expected output, 



and subsequently the weights of the interconnections 
are modified so as to minimize the error between the 
obtained and the expected output; Non-supervised 
learning networks: In this type of learning, none of the 
expected output types is passed on to the network, but 
the network itself searches for the differences between 
the inputs and separates the data accordingly. 

Evolutionary Computation 

Evolutionary computation (Rechenberg, 1971)(Hol- 
land, 1975) is a technique that is inspired on evolutive 
biological strategies: genetic algorithms, for example, 
use biological techniques of cross-over, mutation, and 
selection to solve searching and optimization problems. 
Each of these operators has an impact on one or more 
chromosomes, i.e. possible solutions to the problem, 
and generates another series of chromosomes, i.e. the 
following generation of solutions. The algorithm is 
executed iteratively and as such takes the population 
through the generations until it finds an optimal solu- 
tion. Another strategy of evolutionary computation 
is genetic programming (Koza 1990), which uses the 
same operators as the genetic algorithms to develop 
the optimal program to solve a problem. 

Swarm Intelligence 

Swarm intelligence (Beni, 1989)(Bonabeau, 2001)(En- 
gelbrench, 2005) is a recent family of bio-inspired 
techniques based on the social or collective behaviour 
of groups such as ants, bees, etc., insects which have 
very limited capacities as individuals, but form groups 
to carry out complex tasks. 

Immune Artificial System 

The immune artificial system (Farmer, 1 986)(Dasgupta, 
1999) is a new computational paradigm that has ap- 
peared in recent years and is based on the immune 
system of vertebrates. The biological immune system 
is a parallel and distributed adaptive system that uses 
learning, memory, and associative recuperation to 
solve problems of recognition and classification. It 
particularly learns to recognize patterns, remember 
them, and use their combinations to build efficient 
pattern detectors. From the point of view of informa- 
tion processing, these interesting features are used 
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in the artificial immune system to successfully solve 
complex problems. 



CONCLUSION 

This article describes the main problems that are 
presently found in the field of bio-informatics. It also 
presents some of the bio-inspired computation tech- 
niques that provide solutions for problems related to 
classification, clustering, minimization, modelling, etc. 
The following article will describe a series of techniques 
that allow researchers to solve the above problems with 
bio-inspired models. 
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KEY TERMS 

Amino Acid: One of the 20 chemical building 
blocks that are joined by amide (peptide) linkages to 
form a polypeptide chain of a protein. 

Artificial Immune System: Biologically inspired 
computer algorithms that can be applied to various 
domains, including fault detection, function optimi- 
zation, and intrusion detection. Also called computer 
immune system. 



Electroforesis: The use of an external electric 
field to separate large biomolecules on the basis of 
their charge by running them through acrylamide or 
agarose gel. 

Messenger RNA: The complementary copy of DNA 
formed from a single-stranded DNA template during 
the transcription that migrates from the nucleus to the 
cytoplasm where it is processed into a sequence carrying 
the information to code for a polypeptide domain. 

Microarray: A 2D array, typically on a glass, filter, 
or silicon wafer, upon which genes or gene fragments 
are deposited or synthesized in a predetermined spatial 
order allowing them to be made available as probes in 
a high-throughput, parallel manner. 

Nucleotid: A nucleic acid unit composed of a 
five carbon sugar joined to a phosphate group and a 
nitrogen base. 

Swarm Intelligence: An artificial intelligence 
technique based on the study of collective behaviour 
in decentralised, self-organised systems. 

Transcription: The assembly of complementary 
single-stranded RNA on a DNA template. 

Translation: The process of converting RNA to 
protein by the assembly of a polypeptide chain from 
an mRNA molecule at the ribosome. 
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INTRODUCTION 

Our previous article presented several computational 
models inspired on biological models, such as neural 
networks, evolutionary computation, swarm intel- 
ligence, and the artificial immune system. It also ex- 
plained the most common problems in bioinformatics 
to which these models can be applied. 

The present article presents a series of approaches 
to bioinformatics tasks that were developed by means 
of artificial intelligence techniques and focus on bio- 
inspired algorithms such as artificial neural networks 
and evolutionary computation. 



BACKGROUND 

Previous publications have focused on the use of bio- 
inspired and other artificial intelligence techniques. 
Keedwell (2005) has summarized the foundations of 
molecular biology, the main problems in bioinfor- 
matics, and the existing solutions based on artificial 
intelligence. Baldi (Baldi, 2001) also describes vari- 
ous techniques for problem-solving in bioinformatics. 
Other generalizing works on this subject can be found 
in (Larranaga, 2006), whereas more specialized works 
focus on solutions based on evolutionary computation 
(Pal, 2006) or artificial life (Das, 2007). 

Bio-Inspired Techniques 

The following section describes how the techniques that 
were mentioned in our article Bio-inspired Algorithms 
in Bioinformatics I have been used to solve the main 
problems in bioinformatics. 

Gene Expression 

We start by describing how artificial intelligence 
techniques have contributed to the interpretation of 



genes expression. Artificial neural networks (ANNs) 
have been applied extensively to the classification of 
genetic data. One of the most commonly used archi- 
tectures for the classification of this type of data is the 
multilayer perceptron. Many works use this architecture 
for diagnosis (Wang, 2006) (Wei, 2005) (Narayanan, 
2004) and obtain very good results; most of these 
approaches use artificial neural networks to discover 
and classify interactions between variables (genes 
expression values). 

Statnikov (2005) and Lee (2005) compare several 
classification techniques, such as ANNs using back- 
propagation, probabilistic ANNs, Support Vector 
Machines (SVM), K-Nearest Neighbour (KNN), and 
other statistical methods for the classification of data 
that issue from microarrays expression tests. In this type 
of genetic expression data classification, we can also 
find a combination of ANNs and genetic programming: 
Ritchie (Ritchie, 2004) codifies into each individual of 
the genetic algorithm (GA) the architecture and weights 
of the network, so that the genetic programming opti- 
mizes the network to minimize the error between the 
output layer and the expected output, or the hybrids 
between the ANNs and the genetic algorithms of Kim 
(Kim, 2004) and Keedwell (Keedwell, 2005). 

Genetic programming (GP) as such has also been 
used (Gilbert, 2000; Hong, 2004; Langdon, 2004; Hong, 
2006) to classify the results of an expression analysis. 
The advantage of GP is that it classifies the genes while 
selecting the relevant ones (Muni, 2006). The training 
set of the expression data patients and control are the 
input for the GP algorithm, which evaluates whether or 
not the example is a control. The result is one or a set of 
classification rules. The advantage of using GP instead 
of other techniques such as SVM is that it is transpar- 
ent: the mechanism used to classify the examples of 
the patients can be evaluated (Driscoll, 2003). 

Whereas the above studies all classify by means 
of supervised learning, the following section presents 
various expression analysis methods for clustering that 
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use non-supervised learning. This type of analysis is 
very useful to discover gene groups that are potentially 
related or associated to the illness. A comparison be- 
tween the most commonly applied methods, using both 
real and simulated data, can be found in the works of 
Thalamuthu (2006), Handl (2005), and Sheng (2005). 
Even though these methods have provided good results 
in certain cases (Spellman, 1998; Tamayo, 1999; Mav- 
roudi, 2002), some of their inherent problems, such as 
the identification of the number of clusters, the cluster- 
ing of the "outliers", and the complexity associated to 
the large amount of data that are being analysed, often 
complicate their use for expression analysis (Sherlock, 
2001). These deficiencies were tackled in a series of 
second generation clustering algorithms, among which 
the self -organising trees (Herrero, 2001; Hsu, 2003). 
Another interesting approach for expression analysis 
is the use of the artificial immune system, which can 
be observed in the works of Ando (Ando 2003), who 
applies immune recognition to classification by mak- 
ing the system select the most significant genes and 
optimize their weights in order to obtain classification 
rules. Finally, de Sousa, de Castro, and Bezerra apply 
this technique to clustering (de Sousa, 2004)(de Castro, 
2001)(Bezerra, 2003). 

Sequence Alignment 

Solutions based on genetic algorithms, such as the 
SAGA (Notredame, 1996), the RAGA, the PRAGA 
(Notredame, 1997, 2002), and others (O' Sullivan, 
2004; Nguyen, 2002; Yokohama, 2001), have been ap- 
plied to sequence alignment since the very beginning. 
The most common method consists in codifying the 
alignments as individuals inside the genetic algorithm. 
There are also hybrid solutions that use not only GA 
but also dynamic programming (Zhang, 1997, 1998); 
and finally, there is the application of artificial life al- 
gorithms, in particular the ant colony algorithm (Chen, 
2006; Moss, 2003). 

Genetic Networks 

In order to correct the problem of the inference of genetic 
networks, the structure of the regulating network and 
the interactions between the participating genes must 
be predicted. The expression of the genes is regulated 
by transitions of states in which the levels of expression 
of the involved genes are updated simultaneously. 



ANNs have been used to model these networks. 
Examples of such approaches can be found in the 
works of Krishna, Keedwell, and Narayanan (Keedwell, 
2003)(Krishna, 2005). 

Genetic algorithms (Ando, 2001)(Tominaga, 2001) 
and hybrid RNA-genetic approaches (Keedwell, 2005) 
have also been used for the same purpose. 

Phylogenetic Trees 

Normally, exhaustive search techniques for the creation 
of phylogenetic trees are computationally unfeasible 
for more than 10 comparisons, because the number 
of possible solutions increases exponentially with the 
number of objects in the comparisons. In order to op- 
timize these searches, researchers have used heuristics 
based on genetic algorithms (Skourikhine, 2000)(Katoh, 
2001)(Lemmon, 2002) that allow the reconstruction of 
the optimal trees with less computational load. Other 
techniques, such as the ant colony algorithm, have also 
been used to reconstruct phylogenetic trees (Ando, 
2002)(Kummorkaew, 2004) (Perretto, 2005). 

Gene Finding and Mapping 

Gene mapping has been approached by methods that use 
only genetic algorithm (Fickett, 1996)(Murao, 2002) 
as well as by hybrid methods that combine genetic 
algorithms and statistical techniques (Gaspin, 1997). 
The problem of gene searching and in particular 
promoter searching has been approached by means of 
neural networks (Liu, 2006), neural networks optimized 
with genetic algorithms (Knudsen, 1999), conventional 
genetic algorithms (Kel, 1998)(Levitsky, 2003),and 
fuzzy genetic algorithms (Jacob, 2005). 

Structure Prediction 

The tridimensional structure of DNA was predicted 
with genetic algorithms (Beckers, 1997) by codifying 
the torsional angles between the atoms of the DNA 
molecule as solutions of the genetic algorithm. Another 
approach was the development of hybrid strategies of 
ANNs and GAs (Parbhane, 2000), in which the network 
approaches the non-linear relations between the inputs 
and outputs of the data set, and the genetic algorithm 
searches within the network inputs space to optimize 
the output. In order to predict the secondary structure 
of the RNA, the system calculates the minimum free 
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energy of the structure for all the different combinations 
of the hydrogene links. There are approaches that use 
genetic algorithms (Shapiro, 2001)(Wiese, 2003) and 
artificial neural networks (Steeg, 1997). 

Artificial neural networks have been applied to the 
prediction of protein structures (Qian, 1988)(Sasagawa, 
1992), and so have genetic algorithms. A compilation of 
the application of evolutionary computation in protein 
structures prediction can be found in (Schulze-Kremer, 
2000). Swarm intelligence, and optimization by ant 
colony in particular, have been applied to structures 
prediction (Shmygelska, 2005)(Chu, 2005) and artificial 
immune system (Nicosia, 2004)(Cutello, 2007). 



CONCLUSION 

This article presents a compendium of the most recent 
references on the application of bio-inspired solutions 
such as evolutionary computation, artificial neural net- 
works, swarm intelligence, and artificial immune system 
to the most common problems in bioinformatics. 
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KEY TERMS 

Bioinformatics: The use of applied mathematics, 
informatics, statistics, and computer science to study 
biological systems. 

Gene Expression: The conversion of information 
from gene to protein via transcription and transla- 
tion. 

Gene Mapping: Any method used for determining 
the location of a relative distance between genes on a 
chromosome. 

Gene Regulatory Network: Genes that regulate or 
circumscribe the activity of other genes; specifically, 
genes with a code for proteins (repressors or activators) 
that regulate the genetic transcription of the structural 
genes and/or regulatory genes. 

Phylogeny: The evolutionary relationships among 
organisms. The patterns of lineage branching produced 
by the true evolutionary history of the organism that 
is being considered. 

Sequence Alignment: The result of comparing two 
or more gene or protein sequences in order to determine 
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their degree of base or amino acid similarity. Sequence 
alignments are used to determine the similarity, homol- 
ogy, function, or other degrees of relatedness between 
two or more genes or gene products. 

Structure Prediction: Algorithms that predict the 
2d or 3D structure of proteins or DNA molecules from 
their sequences. 
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INTRODUCTION 

An associative memory AM is a special kind of neural 
network that allows recalling one output pattern given 
an input pattern as a key that might be altered by 
some kind of noise (additive, subtractive or mixed). 
Most of these models have several constraints that 
limit their applicability in complex problems such 
as face recognition (FR) and 3D object recognition 
(3DOR). 

Despite of the power of these approaches, they cannot 
reach their full power without applying new mechanisms 
based on current and future study of biological neural 
networks. In this direction, we would like to present 
a brief summary concerning a new associative model 
based on some neurobiological aspects of human brain. 
In addition, we would like to describe how this dynamic 
associative memory (DAM), combined with some 
aspects of infant vision system, could be applied to 
solve some of the most important problems of pattern 
recognition: FR and 3DOR. 



BACKGROUND 

Humans possess several capabilities such as learning, 
recognition and memorization. In the last 60 years, 
scientists of different communities have been trying 
to implement these capabilities into a computer. Along 
these years, several approaches have emerged, one 
common example are neural networks (McCulloch & 
Pitts, 1943) (Hebb, 1949) (Rosenblatt, 1958). Since the 
rebirth of neural networks, several models inspired in 
the neurobiological process have emerged. Among these 
models, perhaps the most popular is the feed-forward 
multilayer perceptron trained with the back-propaga- 
tion algorithm (Rumelhart & McClelland, 1986). Other 
neural models are associative memories, for example 
(Anderson, 1972) (Hopfield, 1982) (Sussner, 2003) 
(Sossa, Barron & Vazquez, 2004). On the other hand, 



the brain is not a huge fixed neural network as had been 
previously thought, but a dynamic, changing neural 
network. In this direction, several models have emerged 
for example (Grossberg, 1967) (Hopfield, 1982). 

In most of these classical neural networks 
approaches, synapses are only adjusted during the 
training phase. After this phase, synapses are no longer 
adjusted. Modern brain theory uses continuous-time 
model based on current study of biological neural 
networks (Hecht-Nielse, 2003). In this direction, the 
next section described a new dynamic model based on 
some aspects of biological neural networks. 

Dynamic Associative Memories (DAMs) 

The dynamic associative model is not an iterative model 
as Hopfield's model. It emerges as an improvement of 
the model and results presented in (Sossa, Barron & 
Vazquez, 2007). 

Let x e R n and y e R m an input and output pattern, 
respectively. An association between input pattern x 
and output pattern y is denoted as (x k , y k ), where k is 
the corresponding association. Associative memory: 
W is represented by a matrix whose components w.. 
can be seen as the synapses of the neural network. 

If x k = y k V/c =l,...,p then W is auto-associative, 
otherwise it is hetero-associative. A distorted version 
of a pattern x to be recalled will be denoted as x . If an 
associative memory W is fed with a distorted version 
of x k and the output obtained is exactly y k , we say that 
recalling is robust. 

Because of several regions of the brain interact 
together in the process of learning and recognition 
(Laughlin & Sejnowski, 2003), in the dynamic model 
there are defined several interacting areas; also it 
integrated the capability to adjust synapses in response 
to an input stimulus. Before the brain processes an input 
pattern, it is hypothesized that pattern is transformed 
and codified by the brain. This process is simulated 
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using the procedure introduced in (Sossa, Barron & 
Vazquez, 2004). 

This procedure allows computing coded patterns 
and de-coding patterns from input and output patterns 
allocated in different interacting areas of the model. 
In addition a simplified version of x k denoted by s^is 
obtained as: 



s k = s(x k )=midx k 



(1) 



(n+l)/2 ' 



where mid operator is defined as mid x = x 

When the brain is stimulated by an input pattern, 
some regions of the brain (interacting areas) are 
stimulated and synapses belonging to these regions are 
modified. In this model, the most excited interacting 
area is call active region (AR) and could be estimated 
as follows: 



ar 



r(x)=arg min s(x)- 



(2) 



Once computed the coded patterns, the de-coding 
patterns and s^we can build the associative memory. 

Let %x\y k }k=l,...,p},x k GR n ,y k sR m a 
fundamental set of associations (coded patterns). 
Synapses of associative memory W are defined as: 



w n=y i - x j 



(3) 



In short, building of the associative memory can be 
performed in three stages as: 

1. Transform the fundamental set of association into 
coded and de-coding patterns. 

2. Compute simplified versions of input patterns by 
using equation 1. 

3. Build W in terms of coded patterns by using 
equation 3. 

There are synapses that can be drastically modified 
and they do not alter the behavior of the associative 
memory. On the contrary, there are synapses that can 
only be slightly modified to do not alter the behavior 
of the associative memory; we call this set of synapses 
the kernel of the associative memory and it is denoted 

byK w . 



Let K w e R n the kernel of an associative memory 
W. A component of vector K w is defined as: 



kw t =mid(w. ), j =!,..., 



m 



(4) 



Synapses that belong to K w are modified as a 
response to an input stimulus. Input patterns stimulate 
some ARs, interact with these regions and then, 
according to those interactions, the corresponding 
synapses are modified. An adjusting factor denoted by 
Aw can be computed as: 



Aw = A(x)=s(x ar )-s(x) 



(5) 



where ar is the index of the AR. 

Finally, synapses belonging to K w are modified 



as: 



K w =K w 0(Aw-Aw oM ) 
where operator 



(6) 



is defined as 



x © e = x i . + e Vz = 1, . . . , m . 

Once synapses of the associative memory have 
been modified in response to an input pattern, every 

component of vector y can be recalled by using its 
corresponding input vector x as: 



y t =mid(w.. + x.),j=l,...,n 



(?) 



In short, pattern y can be recalled by using its 
corresponding key vector x or x in six stages: 

1. Obtain index of the active region ar by using 
equation 2. 

2. Transform x k using de-coding pattern x ar by apply- 
ing the following transformation: x k = x k + x ar . 

3. Compute adjust factor Aw = A (x) by using equa- 
tion 5. 

4. Modify synapses of associative memory W that 
belong to K w by using equation 6. 

5. Recall pattern y k by using equation 7. 

6. Obtain y k bytransforming y k using de-codingpattern 
y ar by applying transformation: y k = y k - y ar . 
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The formal set of propositions that support the 
correct functioning of this dynamic model, the main 
advantages against other classical models and some 
interesting applications of this model are described 
in (Vazquez, Sossa & Garro, 2006) and (Vazquez & 
Sossa, 2007). 

In general, we distinguish two main parts in this 
model: a part concerning to the determination of the 
AR (PAR) and a part concerning to pattern recall (PPR). 
PAR (first step during recall procedure) sends a signal to 
PPR (remaining steps for recall procedure) and indicates 
the region activated by the input pattern. 



FACE AND 3D OBJECT RECOGNITION 
USING SOME ASPECTS OF THE 
INFANT VISION SYSTEM AND DAMS 

Several statistical computationally expensive tech- 
niques (dimension reduction techniques) such as prin- 
cipal component analysis and factor analysis have been 
proposed, for solving the FR and 3DOR problem. 

Instead of using the complete version of the 
describing pattern X of any face or object, a simplified 
version from describing pattern X could be used to 
recognize a face or an object. In many papers, authors 
have used PCA to perform FR and other tasks, refer 
for example to (Turk & Pentland, 1991). 

During early developmental stages, there are 
communication pathways between the visual and 
other sensory areas of the cortex, showing how the 
biological network is self-organizing. Within a few 
months of birth, the baby is able to differentiate one 
face or obj ects (toys) from others. Barlow hypothesized 
that for a neural system one possible way of capturing 
the statistical structure was to remove the redundancy 
in the sensory outputs (Barlow, 2001). 

By taking into account the theory of Barlow, we 
propose a novel method for FR and 3DOR based on 
some biological aspects of infant vision. The biological 
hypotheses of this proposal are based on the role of 
the response to low frequencies at early stages, and 
some conjectures concerning how an infant detects 
subtle features (stimulating points (SP)) in a face 
or object (Mondloch et al., 1999; Acerra, Burnod, & 
Schonen, 2002). 

The proposal consists on several DAMs used to 
recognize different images of faces and objects. As the 
infant vision responds to low frequencies of the signal, 



a low-pass filter is first used to remove high frequency 
components from the image. After that, we divide the 
image in different parts (sub-patterns). Then, over each 
sub-pattern, we detect subtle features by means of a 
random selection of SPs. Preprocessing images used 
to remove high frequencies and random selection of 
SPs contribute to eliminating redundant information 
and help the DAMs to learn efficiently the faces or 
the objects. At last, each DAM is fed with these sub- 
patterns for training and recognition. 

Response to Low Frequencies 

Instead of using a filter that exactly simulates the infant 
vision system behavior at any stage, we use a low-pass 
filter to remove high frequency. This kind of filter could 
be seen as a slight approximation of the infant vision 
system due to it eliminates high frequency components 
from the pattern, see Figure 1. 

Random Selection 

In the DAM model, the simplified version of an input 
pattern is the middle value of input pattern. In order 
to simulate the random selection of the infant vision 
system we have substituted mid operator with rand 
operator defined as follows: 



rand x = x 



sp 



(8) 



where sp = random(n) is a random number between 
zero and the length of input pattern, sp is a constant 
value computed at the beginning of the building phase 
and represents a SP. During recalling phase sp takes 
the same value. 

rand operator uses a uniform random generator to 
select a component over each part of the pattern. We 
adopt this operator based on the hypothetical idea about 
infants are interested into sets of features where each 
set is different with some intersection among them. By 
selecting features at random, we conjecture that at least 
we select a feature belonging to these sets. 

Implementation of the Proposal 

During recalling, each DAM recovers a part of the im- 
age based on the AR of each DAM. However, a part 
of the image could be wrongly recalled because its 
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Figure 1. Images filtered with masks of different size. Each group could be associated with different stages of 
infant vision system. 
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Three 
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corresponding AR could be wrongly determined due to 
some patterns do not satisfy the prepositions that guar- 
antee perfect recall. To avoid this, we use an integrator. 
Each DAM determines an AR, the index of the AR is 
sent to the integrator, the integrator determines which 
was the most voted region and sends to the DAMs the 
index of the most voted region (the new AR). 

Let I* and it an association of images 
and r be the number of DAMs. Building of the nDAMs 
is done as follows: 



1. 

2. 

3. 
4. 



Select filter size and apply it to the images. 

Transform the images into a vector (x k ,y k ) by 
means of the standard image scan method where 
vectors are of size a x b and c x d respectively. 

Decompose x k and y k in r sub-patterns of the 

same size. 

Take each sub-pattern (from the first one to the last 

one (r)), then take at random a SP sp t , i = 1, . . . , r 
and extract the value at that position. 
Train r DAMS as in building procedure taking 
each sub-pattern (from the first one to the last one 
(r)) using rand operator. 



Pattern r can be recalled by using its corresponding 
key image l k x or i* as follows: 



1 . Select filter size and apply to the images. 

2. Transform the images into a vector by means of 
the standard image scan method and decompose 
x k in r sub-patterns of the same size. 

3. Use the SP, sp t , i = 1, . . . , r computed during the 
building phase and extract the value of each sub- 
pattern. 

4. Determine the most voted active region using the 
integrator. 

5. Substitute mid with rand operator in recalling 
procedure and apply steps from two to six as 
described in recalling procedure on each DAM. 

6. Finally, put together recalled sub-patterns to form 
the output pattern. 

A schematic representation of the building and 
recalling phases is shown in Figure 2. 

Some Experimental Results 

To test the accuracy of the proposal, we performed two 
experiments. In experiment 1, we used a benchmark 
(Spacek, 1996) of faces of 15 different people. In ex- 
periment 2, we use a benchmark (Nene, 1996) of 100 
obj ects. During the training process in both experiments, 
the DAM performed with 100% accuracy using only 
one image of each person and object. During testing, 
the DAM performed in average with 99% accuracy 
for the remaining 285 images of faces (experiment 1) 
and 95% accuracy for the remaining 1900 images of 
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Figure 2. (a) Schematic representation of building phase, (b) Schematic representation of the recalling phase. 
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objects (experiment 2) by using different sized-filter 
and SPs. 

Through several experiments we have tested the 
accuracy and stability of the proposal using different 
number of stimulation points, see Figure 3 and Figure 
4. Because of SPs (pixels) were randomly selected, we 
decided to test the stability of proposal with the same 
configuration 20 times. 

An extra experiment was performed with images 
partially occluded. In average, the accuracy of the 
proposal diminished to 80%. 

While PC A dimension reduction techniques require 
the covariance matrix to build an Eigenspace, then to 
proj ect patterns using this space to eliminate redundant 
information, our proposal only requires removing high 
frequencies by using a filter and a random selection of 
stimulating points. 

This approach contributes to eliminating redundant 
information; it is less computationally expensive than 
PC A, and helps the DAMs or other classification tools 
to learn efficiently the faces or objects. 



FUTURE TRENDS 

Preprocessing images used to remove high frequencies 
and random selection of SPs contribute eliminating 
unnecessary information and help the DAM to learn 
efficiently faces and objects. Now we need to study 
new mechanisms based on evolutionary techniques in 
order to select the most important SPs. In addition, we 
need to test different types of filters that really simulate 
the behavior of the infant vision system. 

In a near future, we pretend to use this proposal 
as a biological model to explain the learning process 
in infant's brain for FR and 3DOR. One step in this 
direction can be found in (Vazquez & Sossa, 2007). 



CONCLUSION 

In this paper, we have proposed a novel method for FR 
and 3DOR based on some biological aspects of infant 
vision. We have shown that by applying some aspects 
of the infant vision system it is possible to enhance 
the performance of an associative memory (or other 
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Figure 3. Accuracy of the proposal using different filter size. The reader can verify the accuracy of the proposal 
diminish after apply a filter of size greater than 25. 
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Figure 4. Average accuracy of the proposal. Maximum, average and minimum accuracy are sketched. 
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distance classifiers) and make possible its application 
to complex problems such as FR and 3DOR. 

In order to recognize different images of face or 
objects we have used several DAMs. As the infant vision 
responds to low frequencies of the signal, a low-filter 
is first used to remove high frequency components 
from the image. Then we detected subtle features in the 
image by means of a random selection of SPs. At last, 
each DAM was fed with this information for training 
and recognition. 



Through several experiments, we have shown the 
accuracy and the stability of the proposal even under 
occlusions. In average, the accuracy of the proposal 
oscillates between 95% and 99%. 

The results obtained with the proposal were 
comparable with those obtained by means of a PCA- 
based method (99%). Although PC A is a powerful 
technique it consumes a lot of time to reduce the 
dimensionality of the data. Our proposal, because of 
its simplicity in operations, is not a computationally 
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expensive technique and the results obtained are 
comparable to those provided by PCA. 
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KEY TERMS 

Associative Memory: Mathematical device spe- 
cially designed to recall output patterns from input 
patterns that might be altered by noise. 

Dynamic Associative Memory: A special type of 
associative memory composed by dynamical synapses. 
This memory adjusts the values of their synapses during 
recalling phase in response to input stimuli. 

Dynamical Synapses: Synapses that modified their 
values in response to an input stimulus also during 
recalling phases. 
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Low-Pass Filter: Filter which removes high fre- 
quencies from an image or signal. This type of filters 
is used to simulate the infant vision system at early 
stages. Examples of these filters are the average filter 
or the median filter. 

PC A: Principal component analysis is a technique 
used to reduce multidimensional data sets to lower 
dimensions for analysis. PCAinvolves the computation 
of the eigenvalue decomposition of a data set, usually 
after mean centering the data for each attribute. 



Random Selection: Selection of one or more com- 
ponents of a vector at randomly manner. Random selec- 
tion techniques are used to reduce multidimensional 
data sets to lower dimensions for analysis. 

Stimulating Points: Characteristic points of an 
object in an image used during learning and recog- 
nition, which capture the attention of a child. These 
stimulating points are used to train the dynamic as- 
sociative memory. 
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INTRODUCTION 

The knowledge about higher brain centres in insects 
and how they affect the insect's behaviour has in- 
creased significantly in recent years by theoretical 
and experimental investigations. Nowadays, a large 
body of evidence suggests that higher brain centres of 
insects are important for learning, short-term, long- 
term memory and play an important role for context 
generalisation (Bazhenof etal, 2001). Related to these 
subjects, one of the most interesting goals to achieve 
would be to understand the relationship between se- 
quential memory encoding processes and the higher 
brain centres in insects in order to develop a general 
"insect-brain" control architecture to be implemented 
on simple robots. In this contribution, it is showed a 
review of the most important and recent results related 
to spatio-temporal coding and it is suggested the pos- 
sibility to use continuous recurrent neural networks 
(CRNNs) (that can be used to model non-linear systems, 
in particular Lotka-Volterra systems) in order to find 
out a way to model simple cognitive systems from 
an abstract viewpoint. After showing the typical and 
interesting behaviors that emerge in appropriate Lotka- 
Volterra systems (in particular, winnerless competition 
processes) next sections deal with a brief discussion 
about the intelligent systems inspired in studies coming 
from the biology. 



BACKGROUND 

What do we name "computation"? Let us say a sys- 
tem shows the capability to compute if it has memory 
(or some form of internal plasticity) and it is able to 



determine the appropriate decision (or behavior, or 
action) given a criteria and making calculations using 
what it senses from the outside world. Some biological 
systems, like several insects, have brains that show a 
type of computation that may be described function- 
ally by a specific type of non-linear dynamical systems 
called Lotka-Volterra systems (Rabinovich etal., 2000). 
According to our objectives, one of the first interests 
focuses on how an artificial recurrent neural network 
could model a non-linear system, in particular, a Lotka- 
Volterra system (Afraimovich et ah, 2004) and what 
are the typical processes that emerge in Lotka-Volterra 
systems (Rabinovich et al., 2000). If it could be under- 
stood, then it would be clearer how the relationships 
between sequential memory encoding processes and 
the higher brain centres in insects are. 

About higher brain centers (and how they affect an 
insect's behaviour) it is possible to stop the functioning 
of particular neurons under investigation during phases 
of experiments and gradually reestablish the functioning 
of the neural circuit (Gerber et al., 2004). At the pres- 
ent, it is known that higher brain centers in insects are 
related on autonomous navigation, multi-modal sensory 
integration, and to an insect's behavioral complexity 
generally; evidence also suggests an important role 
for context generalization, short-term and long-term 
memory (McGuire etal., 200 1). For a long time, insects 
have inspired robotic research in a qualitative way but 
insect nervous systems have been under-exploited as 
a source for potential robot control architectures. In 
particular it often seems to be assumed that insects only 
perform ' reactive' behavior, and more complex control 
will need to be modeled on 'higher' animals. 
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SPATIO-TEMPORAL NEURAL CODING 
GENERATOR 

The ability to process sequential information has long 
been seen as one of the most important functions of 
"intelligent" systems (Huerta et a/., 2004). As it will 
be shown afterwards, winnerless competition principle 
appears as a major type of mechanism of sequential 
memory processing. The underlying concept is that 
sequential memory can be encoded in a (multidimen- 
sional) dynamical system by means of heteroclinic 
trajectories connecting several saddle points. Each 
of the saddle points is assumed to be remembered for 
further action (Afraimovich et al., 2004). 

Computation over Neural Networks 

Digital computers are considered universal in the sense 
of capability to implement any symbolic algorithm. If 
artificial neural networks, that have a great influence on 
the field of computation, are considered as a paradigm 
of computation, one may ask how the relation between 
neural networks and the classical computing paradigm 
is. For this question it is needed to consider, on the one 
hand, discrete computation (digital) and on the other 
hand, nondiscrete computation (analog). In terms of the 
first, the traditional paradigm is the Turing Machine 
with the Von Neumann architecture. A decade ago it was 
shown that artificial neural networks of analog neurons 
and rational weights are computationally equivalent to 
Turing machines. In terms of analog computation, it 
was also showed that three-layer feedforward nets can 
approximate any smooth function with arbitrary preci- 
sion (Hornik et a/., 1990). This result was extended to 
show how continuous recurrent neural nets (CRNN) 
can approximate an arbitrary dynamical system as 
given by a system of n coupled first-order differential 
equations (Tsung, 1994; Chow and Li, 2000). 

Neural Network Computation from a 
Dynamical-System Viewpoint 

Modern dynamical systems theory is concerned with 
the qualitative understanding of asymptotic behav- 
iors of systems that evolve in time. With complex 
non-linear systems, defined by coupled differential, 
difference or functional equations, it is often impos- 
sible to obtain closed-form (or asymptotically closed 
form) solutions. Even if such solutions are obtained, 



their functional forms are usually too complicated to 
give an understanding of the overall behavior of the 
system. In such situations qualitative analysis of the 
limit sets (fixed points, cycles or chaos) of the system 
can often offer better insights. Qualitative means that 
this type of analysis is not concerned with the quantita- 
tive changes but rather what the limiting behavior will 
be (Tsung, 1994). 

Spatio-Temporal Neural Coding and 
Winnerless Competition Networks 

It is important to understand how the information is 
processed by computation from a dynamical viewpoint 
(in terms of steady states, limit cycles and strange at- 
tractors) because it gives us the possibility of manage 
sequential processes (Freeman, 1990). In this section 
it is showed a new direction in information dynamics 
namely the Winnerless Competition (WLC) behavior. 
The main point of this principle is the transformation 
of the incoming spatial inputs into identity-temporal 
output based on the intrinsic switching dynamics of 
a dynamical system. In the presence of stimuli the 
sequence of the switching, whose geometrical image 
in the phase space is a heteroclinic contour, uniquely 
depends on the incoming information. 

Consider the generalized Lotka-Volterra system 
(N=3): 

d 1 =a 1 [l-(a 1 + p 12 a 2 + p 1 3a 3 )] 
a 2 = a 2 [l - (a 2 + p 21 a, + p 23 a 3 )] 

<*3 =a 3 [ 1 -( a 3 + p3i a 2 +P32 a 2 )] 

If the following matrix and parameter conditions 
are satisfied, 



<e,> 



1 



a 1 



Pi 



a. 



0<a ; . <l<P f 

When the coefficients fulfill that a : = a 2 = a 3 < 1 
and p : = P 2 = P 3 > 1, we have three cases: 

1. Stable equilibrium with all three components 
simultaneously present/working. 

2. Three equilibria (1,0,0), (0,1,0) and (0,0,1) all 
stable, each one attainable depending on initial 
conditions. 
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Figure. 1. Topology of the behavior in the phase space ofWLC neurons net in a 3D. The axes represent the ad- 
dresses of the oscillatory connections between the processing units (a = 0.5, p = 1.8, x(0) = 1.01, y(0) = 1.01, 
z(0) = 0.01). 




3. Neither equilibrium points nor periodic solutions 
are asymptotically stable and we have wander- 
ing trajectories defining Winnerless Competition 
(WLC) behaviour 

The advantages of dealing with Lotka-Volterra 
systems are important. It has been shown above how 
a Winnerless competition process can emerge in a 
generalized Lotka-Volterra system. Also it is known 
that this type of process is generalizable to any dy- 
namical system and that any dynamical system can 
be represented by using recurrent neural networks 
(Hornik etal., 1990). From this point of view, winnerless 
competition processes can be obtained whenever that 
exits a boundary condition: the Lotka-Volterra system 
must be of any dimension n greater than three to find 
Winnerless competition behavior. In the following, it 
is assumed that Lotka-Volterra systems approximate 
arbitrarily closely the dynamics of any finite-dimen- 
sional dynamical system for any finite time and we will 
assume and concentrate in showing them as a type of 
neural nets with great interest for applications (Hopfield, 
2001). Various attempts at modeling the complex dy- 
namics in insect brains have been made (Nowotny et 
al., 2004; Rabinovich et al. 2001), and it is suggested 
that simple CRNN systems (Continuous and recurrent 
neural network) could be an alternative framework to 
implement competing processes between neurons that 
generate spatio-temporal patterns to codify memory in 
a similar way simplest living systems do (Rabinovich 
et al. 2006). Recurrent neural networks of competing 
neuron (inspired in how higher brain centres in insects 



work) would allow to explore how building sequential 
memory and might suggest control architectures of 
insect-inspired robotic systems. 

Winnerless Competition Systems 
Generate Adaptive Behavior 

Some features of the winnerless competition systems 
seem to be very promising to use these systems to model 
the activity and the design of intelligent artefacts. It is 
focused on some of the results of previous theoretical 
studies of some authors on systems of n elements coordi- 
nated with excitement-inhibition relations (Rabinovich 
et al., 2001). These systems show: 

Large Capacity: Aheteroclinic (spatiotemporal) 
representation provides greatly increased capacity 
to the system. Because sequences of activity are 
combinatorial across elements and time, overlap 
between representations can be reduced, and the 
distance in phase space between orbits can be 
increased. 

Sensitivity (to similar stimulus) and, simul- 
taneously, capacity for categorization: This 
is because the heteroclinic linking of a specific 
set of saddle points is always unique. Two like 
stimuli, activating greatly overlapping subsets of 
a network, may become easily separated because 
small initial differences will become amplified in 
time. 

Robustness: In the following sense, the attractor 
of a perturbed system remains in a small neighbor- 
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hood of the "unperturbed" attractor (robustness as 
topological similarity of the perturbed pattern). 

All these important features emerge from the dy- 
namic of the Lotka-Volterra system. There are more 
examples: in [Nepomnyashchikh et al, 2003] is de- 
scribed a simple chaotic system of coupled oscillators 
that shows a complex and fruitful adaptive behaviour; 
the interaction among the activity of elements in the 
model and external inputs give rise to an emergence 
of searching rules from basic properties of nonlinear 
systems (rules which have not been pre-programmed 
explicitly) and with obvious adaptive value. More in 
detail: the adaptive rules are autonomous (the system 
selects an appropriate rule with no instructions from 
outside), and they are the result of interaction between 
intrinsic dynamics of the system and dynamics of the 
environment. These rules emerge, in a spontaneous way, 
because of the non-linearity in the simple system. 

Winnerless Competition for Computing 
and Interests in Robotics 

The suggestion of using heteroclinic trajectories with 
computing purposes shows advantages for robotics 
interests. It is known that very simple dynamical sys- 
tems are equivalent to Turing machines and also that 
computing with heteroclinic orbits adds to the classi- 
cal computing the feature of high sensitivity to initial 
conditions increasing. If we consider artefacts with 
computation processes ordered by winnerless competi- 
tion behaviour, the artefacts will have great ability to 
process, manage and store sequential information. In 
spite of the history of studies of sequential learning and 
memory, little is known about dynamical principles of 
storing and remembering of multiple events and their 
temporal order by neural networks. This principle called 
winnerless competition can be a very useful mecha- 
nism to explore and model sequential and scheduled 
processes in industrial and robotic problems. 



cal phenomena as the basis of the adaptive behaviour 
patterns of the living organisms and these systems show, 
in one hand, the coexistence of sensitivity (ability to 
distinguish distinct, albeit similar, inputs) and robust- 
ness (ability to classify similar signals receptions as 
the same one). If we are able to reproduce the same 
characteristics in artificial intelligent architectures, will 
make it easier to go beyond the actual limitations into 
the intelligent systems applied to the real problems. 



CONCLUSION 

It has been summarized how a system architecture 
whose stimulus-dependent dynamics reproduces 
spatio-temporal features could be able to code and 
build a memory inspired in the higher brain centres of 
insects (Nowotny et ah, 2004). Beyond the biological 
observations which suggested these investigations, 
recurrent neural networks where winnerless competi- 
tion processes can emerge, provide an attractive model 
for computation because of their large capacity as well 
as their robustness to noise contamination. It has been 
showed an interesting tool (using control and synchro- 
nization of spatio-temporal patterns) to transfer and 
process information between different neural assemblies 
for classification problems in, eventually, several indus- 
trial environments. For example, winnerless competi- 
tion processes could be able to solve the fundamental 
contradiction between sensitivity and generalizing of 
the recognition, multistability and robustness to the 
noise in real processes (Rabinovich et al., 2000). For 
classification tasks, is useful to get models that could 
be reproducible. In the language of non-linearity, this 
is possible only if the system is strongly dissipative 
(in other words, if it can rapidly forget its initial state). 
On the other hand, a useful classificator system should 
be sensitive to small variations in the inputs, so that 
fine discriminations between similar but not identical 
stimuli are possible. Winnerless competition principle 
shows both features. 



FUTURE TRENDS 

Computation by heteroclinic orbits provides new 
perspectives to traditional computing. Because of its 
features, it could be interesting building such a kind of 
bio-inspired systems based in Winnerless competition 
processes. Evolution has chosen the nonlinear dynami- 
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KEY TERMS 

Adaptive Behaviour: Type of behavior that al- 
lows an individual to substitute a disruptive behavior 
to something more constructive and able to adapt to a 
given situation. 

Bio-Inspired Techniques: Bio-inspired systems 
and tools are able to bring together results from differ- 
ent areas of knowledge, including biology, engineering 
and other physical sciences, interested in studying and 
using models and techniques inspired from or applied 
to biological systems. 

Computational System: Computation is a general 
term for any type of information processing that can be 
represented mathematically. This includes phenomena 
ranging from simple calculations to human thinking. 
A device able to make computations is called compu- 
tational system. 

Dynamical Recurrent Networks: Complex non- 
linear dynamic system described by a set of nonlinear 
differential or difference equations with extensive 
connection weights. 

Heteroclinic Orbits: In the phase portrait of a 
dynamical system, a heteroclinic orbit (sometimes 
called a heteroclinic connection) is a path in phase 
space which joins two different equilibrium points. If 
the equilibrium points at the start and end of the orbit 
are the same, the orbit is a homoclinic orbit. 

Stability-Plasticity Dilemma: It explores how a 
learning system remains adaptive (plastic) in response 
to significant input, yet remains stable in response to 
irrelevant input. 
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Winnerless Competition Process: Dynamical 
process whose main point is the transformation of the 
incoming identity or spatial inputs into identity-tem- 
poral output based on the intrinsic switching dynamics 
of the neural system. 




261 



262 



Biometric Security Technology 

Marcos Faundez-Zanuy 

Escola Universitaria Politecnica de Mataro, Spain 



INTRODUCTION 

The word biometrics comes from the Greek words 
"bios" (life) and "metrikos" (measure). Strictly speak- 
ing, it refers to a science involving the statistical analysis 
of biological characteristics. Thus, we should refer 
to biometric recognition of people, as those security 
applications that analyze human characteristics for 
identity verification or identification. However, we will 
use the short term "biometrics" to refer to "biometric 
recognition of people". 

Biometric recognition offers a promising approach 
for security applications, with some advantages over 
the classical methods, which depend on something 
you have (key, card, etc.), or something you know 
(password, PIN, etc.). A nice property of biometric 
traits is that they are based on something you are or 
something you do, so you do not need to remember 
anything neither to hold any token. 



Authentication methods by means of biometrics are 
a particular portion of security systems, with a good 
number of advantages over classical methods. However, 
there are also drawbacks (see Table 1). 

Depending on the application, one of the previous 
methods, or a combination of them, will be the most 
appropriate. This article describes the main issues to 
be known for decision making, when trying to adopt 
a biometric security technology solution. 



MAIN FOCUS OF THE ARTICLE 

This article presents an overview of the main topics 
related to biometric security technology, with the central 
purpose to provide a primer on this subject. 

Biometrics can offer greater security and conve- 
nience than traditional methods for people recognition. 
Even if we do not want to replace a classic method 



Table 1. Advantages and drawbacks of the three main authentication method approaches 



Authentication 
method 


Advantages 


Drawbacks 


Handheld tokens (card, 
ID, passport, etc.) 


■ A new one can be issued. 

■ It is quite standard, although moving to 
a different country, facility, etc. 


■ It can be stolen. 

■ A fake one can be issued. 

■ It can be shared. 

■ One person can be registered with 
different identities. 


Knowledge based 
(password, PIN, etc.) 


■ It is a simple and economical method. 

■ If there are problems, it can be replaced 
by a new one quite easily. 


■ It can be guessed or cracked. 

■ Good passwords are difficult to 
remember. 

■ It can be shared. 

■ One person can be registered with 
different identities. 


Biometrics 


■ It cannot be lost, forgotten, guessed, 
stolen, shared, etc. 

■ It is quite easy to check if one person 
has several identities. 

■ It can provide a greater degree of 
security than the other ones. 


■ In some cases a fake one can be issued. 

■ It is neither replaceable nor secret. 

■ If a person's biometric data is stolen, it 
is not possible to replace it. 
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(password or handheld token) by a biometric one, for 
sure, we are potential users of these systems, which 
will even be mandatory for new passport models. For 
this reason, it is useful to be familiarized with the pos- 
sibilities of biometric security technology. 



BIOMETRIC TRAITS 

The first question is: Which characteristic can be used 
for biometric recognition? As common sense says, a 
good biometric trait must accomplish a set of proper- 
ties. Mainly they are (Clarke, 1994), (Mansfield & 
Wayman, 2002): 

Universality: Every person should have the char- 
acteristic. 

Distinctiveness: Any two persons should be dif- 
ferent enough to distinguish each other based on 
this characteristic. 

Permanence: the characteristic should be stable 
enough (with respect to the matching criterion) 
along time, different environment conditions, 
etc. 

Collectability : the characteristic should be acquir- 
able and quantitatively measurable. 
Acceptability: people should be willing to accept 
the biometric system, and do not feel that it is 
annoying, invasive, etc. 

Performance: the identification accuracy and 
required time for a successful recognition must 
be reasonably good. 

Circumvention: the ability of fraudulent people 
and techniques to fool the biometric system should 
be negligible. 



Biometric traits can be split into two main catego- 



ries: 



Physiological biometrics: it is based on direct 
measurements of a part of the human body. 
Fingerprint (Maltoni et al., 2003), face, iris and 
hand-scan (Faundez-Zanuy, Navarro-Merida, 
2005) recognition belong to this group. 
Behavioral biometrics: it is based on measure- 
ments and data derived from an action performed 
by the user, and thus indirectly measures some 
characteristics of the human body. Signature 



(Faundez-Zanuy, 2005c) , gait, gesture and key 
stroking recognition belong to this group. 

However, this classification is quite artificial. For 
instance, the speech signal (Faundez-Zanuy and Monte, 
2005) depends on behavioral traits such as semantics, 
diction, pronunciation, idiosyncrasy, etc. (related to 
socio-economic status, education, place of birth, etc.) 
(Furui, 1 989). However, it also depends on the speaker's 
physiology, such as the shape of the vocal tract. On 
the other hand, physiological traits are also influenced 
by user behavior, such as the manner in which a user 
presents a finger, looks at a camera, etc. 

Verification and Identification 

Biometric systems can be operated in two modes, 
named identification and verification. We will refer to 
recognition for the general case, when we do not want 
to differentiate between them. However, some authors 
consider recognition and identification synonymous. 

Identification: In this approach no identity is 
claimed from the user. The automatic system must 
determine who the user is. If he/ she belongs to a 
predefined set of known users, it is referred to as 
closed-set identification. However, for sure the 
set of users known (learnt) by the system is much 
smaller than the potential number of people that 
can attempt to enter. The more general situation 
where the system has to manage with users that 
perhaps are not modeled inside the database is 
referred to as open-set identification. Adding a 
"none-of-the-above" option to closed-set identi- 
fication gives open-set identification. The system 
performance can be evaluated using an identifica- 
tion rate. 

Verification: In this approach the goal of the sys- 
tem is to determine whether the person is the one 
that claims to be. This implies that the user must 
provide an identity and the system just accepts 
or rejects the users according to a successful or 
unsuccessful verification. Sometimes this opera- 
tion mode is named authentication or detection. 
The system performance can be evaluated using 
the False Acceptance Rate (FAR, those situations 
where an impostor is accepted) and the False Re- 
jection Rate (FRR, those situations where a user 
is incorrectly rejected), also known in detection 
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Figure 1. On the left: example of a DET plot for a user verification system (dotted line). The Equal Error Rate 
(EER) line shows the situation where False Alarm equals Miss Probability (balanced performance). Of course 
one of both errors rates can be more important (high security application versus those where we do not want to 
annoy the user with a high rejection/ miss rate). If the system curve is moved towards the origin, smaller error 
rates are achieved (better performance). If the decision threshold is reduced, higher False Acceptance/ Alarm rates 
are achieved. On the right: Example of a ROC plot for a user verification system (dotted line). The Equal Error 
Rate (EER) line shows the situation where False Alarm equals Miss Probability (balanced performance) . 
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theory as False Alarm and Miss, respectively. 
There is a trade-off betweenboth errors, which has 
to be usually established by adjusting a decision 
threshold. The performance can be plotted in a 
ROC (Receiver Operator Characteristic) or in a 
DET (Detection error trade-off) plot (Martin et 
al., 1989). DET curve gives uniform treatment to 
both types of error, and uses a logarithmic scale 
for both axes, which spreads out the plot and better 
distinguishes different well performing systems 
and usually produces plots that are close to linear. 
Note also that the ROC curve has symmetry with 
respect to the DET, i.e. plots the hit rate instead of 
the miss probability. DET plot uses a logarithmic 
scale that expands the extreme parts of the curve, 
which are the parts that give the most information 
about the system performance. Figure 1, on the 
left shows an example of DET of plot, and on the 
right shows a classical ROC plot. 

For systems working in verification mode, the evolu- 
tion of FAR and FRR versus the threshold setting is an 
interesting plot. Using a high threshold, no impostor 
can fool the system, but a lot of genuine users will be 
rejected. Contrarily, using a low threshold, there would 



not be inconveniences for the genuine users, but it will 
be reasonably easy for a hacker to crack the system. 
According to security requirements, one of both taxes 
will be more important than the other one. 

Is Identification Mode More Appropriate 
Than Verification Mode? 

Certain applications lend themselves to verification, 
such as PC and network security, where, for instance, 
you replace your password by your fingerprint, but 
you still use your login. However, in forensic applica- 
tions it is mandatory to use identification, because, for 
instance, latent prints lifted from crime scenes never 
provide their "claimed identity". 

In some cases, such as room access (Faundez-Zanuy, 
2004c), (Faundez-Zanuy & Fabregas 2005), it can be 
more convenient for the user to operate on identification 
mode. However, verification systems are faster because 
they just require one-to-one comparison (identification 
requires one to N, where N is the number of users in the 
database). In addition, verification systems also provide 
higher accuracies. For instance, a hacker has almost 
N times (Maltoni & al., 2003) more chance to fool an 
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Figure 2. General scheme of a biometric recognition system 
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identification system than a verification one, because 
in identification he/she just needs to match one of the 
N genuine users. For this reason, commercial applica- 
tions operating on identification mode are restricted to 
small-scale (at most, a few hundred users). Forensic 
systems (Faundez-Zanuy, 2005a), (Faundez-Zanuy, 
2005b) operate in a different mode, because they 
provide a list of candidates, and a human supervisor 
checks the automatic result provided by the machine. 
This is related to the following classification, which is 
also associated to the application. 



BIOMETRIC TECHNOLOGIES 

Several biometric traits have been proven useful for 
biometric recognition. Nevertheless, the general scheme 
of a biometric recognition system is similar, in all the 
cases, to that shown in figure 2. 

The scheme shown in figure 2 is also interesting 
for vulnerability study (Faundez-Zanuy, 2004b) and 
improvements by means of data fusion (Faundez-Za- 
nuy, 2004d) analysis. In this paper, we will restrict to 
block number one, the other ones being related to signal 
processing and pattern recognition. Although common 
sense points out that good acquisition is enough for 
performing good recognition, at least for humans, this is 
not true. It must be taken into account that next blocks, 
numbered 2 to 4 in figure 2, are indeed fundamental. 
A good image or audio recording is not enough. Even 
for human beings, a rare disorder named agnosia ex- 
ists. Those individuals suffering agnosia are unable 
to recognize and identify objects or persons despite 
having knowledge of the characteristics of the objects 
or persons. People with agnosia may have difficulty 
recognizing the geometric features of an object or face 



or may be able to perceive the geometric features but 
do not know what the object is used for or whether a 
face is familiar or not. Agnosia can be limited to one 
sensory modality such as vision or hearing. A par- 
ticular case is named face blindness or prosopagnosia 
(http://www.faceblind.org/research ). Prosopagnosics 
often have difficulty recognizing family members, 
close friends, and even themselves. Agnosia can result 
from strokes, dementia, or other neurological disorders. 
More information about agnosia can be obtained from 
the National Organization for Rare Disorders (NORD 
http://www.rarediseases.org). 

Table 2 summarizes some possibilities for differ- 
ent biometric traits acquisition. Obviously, properly 
speaking, some sensors require a digitizer connected 
at its output, which is beyond the scope of this paper. 
We will consider that block number one produces a 
digital signal which can be processed by a Digital Signal 
Processor (DSP) or Personal Computer (PC). 

Figure 3 shows some biometric traits and their cor- 
responding biometric scanners. 

Security and Privacy 

A nice property of biometric security systems is that 
security level is almost equal for all users in a system. 
This is not true for other security technologies. For 
instance, in an access control based on password, a 
hacker just needs to break only one password among 
those of all employees to gain access. In this case, a 
weak password compromises the overall security of 
every system that user has access to. Thus, the en- 
tire system's security is only as good as the weakest 
password (Prabhakar, Pankanti & Jain, 2003). This 
is especially important because good passwords are 
nonsense combinations of characters and letters, which 
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Table 2. Some biometric traits and possible acquisition sensors 



Biometric trait 


Sensor 


Comments 


Fingerprint 


Ink+ paper + scanner 


Classical method is becoming old-fashion, because the ink is annoying. 
However, it can acquire from nail to nail borders, and the other methods 
provide a limited portion of the fingerprint. 


Optical 


It is the most widely used and easy-to-operate technology. It can acquire 
larger surfaces than the capacitive ones. 


Capacitive 


They are easy to integrate into small, low-power and, low-cost devices. 
However, they are more difficult to operate than the optical ones (wet and/ 
or warm fingers). 


Ultrasound 


They are not ready for mass-market applications yet. However, they are 
more capable of penetrating dirt than the other ones, and are not subject to 
some of the image-dissolution problems found in larger optical devices. 


Face 


Photo-camera 


Nowadays almost all the mobile phones have a photo-camera, enabling face 
recognition applications. 


Video-camera 


A sequence of images alleviates some problems, such as face detection and 
offers more possibilities. 


Speech 


Microphone 


The telephone system provides a ubiquitous network of sensors for acquiring 
speech signals. 


Iris 


Kiosk-based systems 


The camera searches for eye position. They are the most expensive ones and 
the easier to operate. 


Physical access devices 


The device requires some user effort: a camera is mounted behind a mirror. 
The user must locate the image of his eye within a 1-inch by 1-inch square 
surface on the mirror. 


Desktop cameras 


The user must look into a hole and look at a ring illuminated inside. 


Retina 


Retina-scanner 


A relative large and specialized device is required. It must be specifically 
designed for retina imaging. Image acquisition is not a trivial matter. 


Signature 


Ball pen + paper + scanner/ 
camera 


The system recognizes the signature analyzing its shape. This kind of 
recognition is known as "off-line", while the other ones are "on-line". 


Graphics tablet 


It acquires the signature in real time. Some devices can acquire: position in 
x and y-axis, pressure applied by the pen, azimuth and altitude angles of the 
pen with respect to the tablet. 


PDA 


Stylus operated PDAs are also possible. They are becoming more popular, 
so there are some potential applications. 


Hand-geometry 


Hand-scanning device 


Commercial devices consist of a covered metal surface, with some pegs 
for ensuring the correct hand position. A series of cameras acquire three 
dimensional (3D) images of the back and the sides of the hand. 


Conventional scanner 


Some research groups at universities have developed systems based on 
images acquired by a conventional document scanner. Thus, the cost is 
reduced, but the acquisition time is at least 15 seconds per image. 


Conventional camera 


Some research groups at universities have developed systems based on 
images acquired by conventional cameras. Using some "bricolage" it is 
possible to obtain a 3D image. 


Palm-print 


Document scanner 


Although there are not commercial applications, some research groups 
at universities have developed systems based on images acquired by a 
conventional document scanner. 


Keystroke 


keyboard 


Although not used habitually, standard keyboards can measure how long 
keys are held down and duration between key instances, which is enough 
for recognition. 
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Figure 3. Examples of biometric traits and an examples of commercial scanners 
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are difficult to remember (for instance, "Jh2pz6R+"). 
Unfortunately, some users still use passwords such as 
"password", "Homer Simpson" or their own name. 



FUTURE TRENDS 

Although biometrics offers a good set of advantages, it 
has not been massively adopted yet (Faundez-Zanuy, 
2005d). One of its main drawbacks is that biometric 
data is not secret and cannot be replaced after being 
compromised by a third party. For those applications 
with a human supervisor (such as border entrance 
control), this can be a minor problem, because the 
operator can check if the presented biometric trait is 
original or fake. However, for remote applications 
such as internet, some kind of liveliness detection and 
anti-replay attack mechanisms should be provided. 
This is an emerging research topic. As a general rule, 
concerning security matters, a constant update is nec- 
essary in order to keep on being protected. A suitable 
system for the present time can become obsolete if it 
is not periodically improved. For this reason, nobody 
can claim that he/ she has a perfect security system, 
and even less that it will last forever. 

Another interesting topic is privacy, which is beyond 
the scope of this article. It has been recently discussed 
in (Faundez-Zanuy, 2005a). 
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KEY TERMS 

Automatic Identification: The system must deter- 
mine who the user is. 

Automatic Verification: The system must deter- 
mine whether the person is the one that claims to be. 

Behavioral Biometrics: Based on measurements 
and data derived from an action performed by the user, 
and thus indirectly measures some characteristics of the 
human body. Signature, gait, gesture and key stroking 
recognition belong to this group. 

Equal Error Rate: System performance when False 
Acceptance Rate is identical to False Rejection Rate. 

False Acceptance Rate: Ratio of impostors whose 
access is incorrectly permitted. 

False Rejection Rate: Ratio of genuine users whose 
access is incorrectly denied. 

Physiological Biometrics : Based on direct measure- 
ments of a part of the human body. Fingerprint, face, 
iris and hand-scan recognition belong to this group. 
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INTRODUCTION 

This work presents a brief introduction to the blind 
source separation using independent component 
analysis (ICA) techniques. The main objective of the 
blind source separation (BSS) is to obtain, from ob- 
servations composed by different mixed signals, those 
different signals that compose them. This objective 
can be reached using two different techniques, the 
spatial and the statistical one. The first one is based on 
a microphone array and depends on the position and 
separation of them. It also uses the directions of arrival 
(DOA) from the different audio signals. 

On the other hand, the statistical separation sup- 
poses that the signals are statistically independent, that 
they are mixed in a linear way and that it is possible 
to get the mixtures with the right sensors (Hyvarinen, 
Karhunen & Oja, 2001) (Parra, 2002). 

The last technique is the one that is going to be 
studied in this work. It is due to this technique is the 
newest and is in a continuous development. It is used 
in different fields such as natural language processing 
(Murata, Ikeda & Ziehe, 200 1 ) (Saruwatari, Kawamura 
& Shikano, 2001), bioinformatics, image processing 
(Cichocki & Amari, 2002) and in different real life ap- 
plications such as mobile communications (Saruwatari, 
Sawai, Lee, Kawamura, Sakata & Shikano, 2003). 

Specifically, the technique that is going to be used 
is the Independent Component Analysis (ICA). ICA 
comes from an old technique called PCA (Principal 
Component Analysis) (Hyvarinen, Karhunen & Oja, 
2001) (Smith, 2006). PCA is used in a wide range of 
scopes such as face recognition or image compression, 
being a very common technique to find patterns in high 
dimension data. 

The BSS problem can be of two different ways; the 
first one is when the mixtures are linear. It means that the 
data are mixed without echoes or reverberations, while 
the second one, due to these conditions, the mixtures 
are convolutive and they are not totally independent 



because of the signal propagation through dynamic 
environments. It is the "Cocktail party problem". De- 
pending on the mixtures, there are several methods to 
solve the BSS problem. The first case can be seen as 
a simplification of the second one. 

The blind source separation based on ICA is also 
divided into three groups; the first one are those methods 
that works in the time domain, the second are those 
who works in the frequency domain and the last group 
are those methods that combine frequency and time 
domain methods. A revision of the technique state of 
these methods is proposed in this work. 



BACKGROUND 

The problem consists in several sources that are mixed 
in a system, these mixtures are recorded and then 
they have to be separated to obtain the estimations of 
the original sources. As was mentioned above, BSS 
problems can be of two different types; the first one, 
when the mixtures are linear, see equation 3, and the 
second one, when the mixtures are convolutive, see 
equation 5. 

In the first case each source signal is multiplied by 
a constant which depends on the environment, and then 
they are added. Convolutive mixtures are not totally 
independent due to the signal propagation through 
dynamic environments. This makes that the signals are 
not simply added. The first case is the ideal one, and 
the second is the most common case, because in real 
room recordings the mixing systems are of this type. 
The following figure shows the mixing system in the 
case of two sources two mixtures: 

Where X x and X 2 are the independent signals, Y x 

and Y 2 are the mixing of the different X., and H is the 
mixing system that can be seen in a general form as: 
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Figure 1. 2 sources - 2 mixtures system. 
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The h.. are FIR filters, each one represents an acoustic 
transference multipath function from source, i, to sensor, 
j. i andy represent the number of sources and sensors. 
Now it is necessary to remember the first condition that 
makes possible the blind source separation: 

"The number of sensors must be greater than or equal 
to the number of sources. " 

Taking this into account, the problem for two sen- 
sors, in a general form, can be represented as: 



Y^X^h, +X 2 */i 2 



Y 2 =X 1 *h 2 +X 2 *h 2 



(2.1) 



(2.2) 



Generally, there are n source signals statistically 

independent X (t) = [X 1 (t),...,X n (t) , and m observed 
mixtures that are linear and instantaneous combina- 
tions of the previous signals Y(t) = [Y 1 (t),...,Y n (t] . 
Beginning with the linear case, the simplest case, the 
mixtures are: 



y f (0 = XV-M0 



(3) 



7=1 



Now, we need to recover X(t) from Y(t). It is neces- 
sary to estimate the inverse matrix of H, where h. are 
contained. Once we have this matrix: 



X(t)=W-Y(t) 



(4) 



where X(t) contains the estimations of the original 

source signals, and Wis the inverse mixing matrix. 
Now we have defined the simplest case, it is time to 
explain the general case that involves convolutive 
mixtures. 

The whole process, which includes mixing and 
separation process, and that has been described before 
for linear mixtures, is defined as in Figure 2. 

The process will be the following; first a set of source 

signals (X) pass through an unknown system H . The 
output ( Y) contains all the mixtures. Y is equalized with 

an inverse estimated system W , which has to give an 
estimation of the original source signals ( x )• 

Given access to N sensors with a number of sources 
less than or equal to N, all with unknown direct and cross 
channels, the objective is to recover all the unknown 
sources. Here arises the second condition to obtain the 
source separation: 

"In blind source separation using ICA, it is assumed 
that we only know the probability density functions of 
the non-Gaussian and independent sources. " 

So we have to obtain W , and it must be that: 



X =W*Y 



(5) 



Here, as it was mentioned above, X are the estima- 
tions of the original source signals, Y are the observa- 
tions, and W is the inverse mixing filter. 

BLIND SOURCE SEPARATION BY ICA 

This article presents different methods to solve the blind 
source separation, more exactly those that are based on 
independent component analysis (ICA). First, methods 
for the linear mixtures are going to be described, and 
then we are going to divide the methods for convolu- 
tive mixtures in three groups depending on the domain; 
frequency domain, time domain or both. 

Blind Source Separation for Linear 
Mixtures 

The blind source separation for linear mixtures is a 
particular case of the convolutive one. So methods 
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Figure 2. BSS general problem 
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designed for convolutive mixtures must solve the 
problem of the linear mixtures in theory. In this case, 
we have decided to describe some linear methods 
separately due to there being some important methods 
specialized in this case. 

There are different methods such as Inf omax, which 
is based on the maximization of the information (Bell 
& Sejnowski, 1 995), the one based on minimization of 
the mutual information (Hyvarinen, Karhunen & Oja, 
2001), or methods which use tensors. The methods that 
are going to be described in this paper are FastlCAand 
JADE, they have been selected due to they are two of 
the most famous methods for the linear case of BSS, 
and because they are a right first step into the BSS. 

The first one, and maybe the most known, is Fas- 
tlCA (Hyvarinen, Karhunen & Oja, 2001). The Fas- 
tlCA algorithm is a computationally highly efficient 
method for performing the estimation of ICA. It uses 
a fixed-point iteration scheme that has been found in 
independent experiments to be 10-100 times faster 
than conventional gradient descent methods for ICA. 
Another advantage of the FastICA algorithm is that 
it can be used to perform projection pursuit as well, 
thus providing a general-purpose data analysis method 
that can be used both in an exploratory fashion and for 
estimation of independent components (or sources) 
(Hyvarinen, Karhunen & Oja, 200 1 ) (Parra, 2002). This 
algorithm is available in a toolbox that is very easy to 
use (Helsinki University of Technology, 2006). 

Another method that is very common in blind 
source separation of linear mixtures is JADE (Cardoso, 
1993) (Hyvarinen, Karhunen & Oja, 2001). JADE 
(Join approximate diagonalization of eigenmatrices) 
refers to one principle of solving the problem of equal 
eigenvalues of the cumulant tensor. In this algorithm, 
the tensor EVD is considered more as a preprocessing 
step. A method closely related to JADE is given by the 
eigenvalue descomposition of the weighted correlation 



matrix. For historical reasons, the basic method is 
simply called fourth-order blind identification (FOBI) 
(Hyvarinen, Karhunen & Oja, 2001). 

Blind Source Separation for Convolutive 
Mixtures 

Once the linear problem has been described, it is time 
to explain how to solve the problem with convolutive 
mixtures. This case is more complex than the linear 
one, as was presented in the Background. When the 
mixtures are convolutive the problem is also called 
blind deconvolution. 

To solve this problem, several methods have been 
designed. They can be divided into three groups 
depending on the domain: time domain, frequency 
domain or both. 

When the algorithm works in the frequency domain, 
the convolutive mixtures can be simplified into simulta- 
neous ones by means of the frequency transform. This 
makes easier the convergence of the separation filter 
(Choi, Cichocki, Park & Lee, 2005). So these algorithms 
can increase the speed of convergence and reduce 
the computational load. But it has a cost; to maintain 
the computational efficiency these algorithms need 
to increase the length of the frame when the window 
frame increases. The data is reduced and can provoke 
insufficiency of the learning data. So the efficiency of 
the algorithm is degraded. 

Some examples of these algorithms are: blind source 
separation in the wavelet domain, recursive method 
(Ding, Hikichi, Niitsuma, Hamatsu & Sugai, 2003), 
time delay decorrelation (Lee, Ziehe, Orglmeister & 
Sejnowski, 1998), ICA algorithm and beamforming 
(Saruwatari, Kawamura & Shikano, 2001) or the FIR 
matrix toolbox 5.0 (Lambert, 1996). 

If the algorithm works in the time domain, we can 
work with wide band audio signals, where the assump- 
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tion of independence is kept. The disadvantages are that 
they produce a high computational load when they work 
with huge separation matrixes, and the convergence is 
slow, overall when the signals are voices. 

Some algorithms that work in time domain are 
SIMO - ICA (Saruwatari, Sawai, Lee, Kawamura, 
Sakata & Shikano, 2003), filter banks (Choi, Cichocki, 
Park & Lee, 2005) or time-domain fast fixed-point 
algorithms for convolutive ICA (Thomas, Deville & 
Hosseini, 2006). 

But we can also combine the two previous methods 
to compensate the advantages and disadvantages of 
each other, for example in Multistage ICA(Nishikawa, 
Saruwatari, Shikano, Araki & Makino, 2003); FDICA 
(frequency domain ICA) and TDICA (time domain 
ICA) are combined with the obj ective of attaining better 
efficiency that is possible. This algorithm has a stable 
behaviour, but the computational load is high. 



FUTURE TRENDS 

Blind source separation has wide scope that is in con- 
tinuous development and several authors are working 
on the design or modification of different methods. In 
this work, different algorithms have been presented to 
solve the blind source separation problem. 

Future trends will be the design of on line methods 
that allow the implementation of these algorithms in 
real life applications such as voice recognition or free 
hand devices. The algorithms can also be improved 
with the aim of reaching more efficient and accurate 
separation of the mixtures, linear or convolutive. It will 
help in different systems as a first step for identifying 
speakers, for example in conferences, videoconfer- 
ences or similar. 



CONCLUSION 

This article shows different ways to solve blind source 
separation. It also illustrates that the problem can be of 
two forms depending on the mixtures. If we have linear 
mixtures the system will have a different behaviour 
that if we have convolutive ones. 

As has been described above, the methods can be 
also divided into two types, the frequency domain 
and the time domain algorithms. The first type has 
a faster convergence than the time domain type, but 



it can work incorrectly if data are insufficient. Time 
domain methods have more stable behaviour than the 
frequency algorithms, but a higher computational load. 
To trade off the advantages with the disadvantages of 
each type of method, multistage algorithms have been 
proposed. 
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KEY TERMS 

Blind Source Separation: The problem of sepa- 
rating from mixtures, the source signals that compose 
them. 

Cocktail Party Problem: A particular case of 
blind source separation where the mixtures are con- 
volutive. 

Convolutive Mixtures: Mixtures that are not linear 
due to echoes and reverberations, they are not totally 
independent because of the signal propagation through 
dynamic environments. 

HOS (High Order Statistics): Higher order statis- 
tics is a field of statistical signal processing that uses 
more information than autocorrelation functions and 
spectrum. It uses moments, cumulants and polyspectra. 
They can be used to get better estimates of parameters 
in noisy situations, or to detect nonlinearities in the 
signals. 

ICA (Independent Component Analysis): Tech- 
niques based on statistical concepts such as high order 
statistics. 

Linear mixtures : Mixtures that are linear combina- 
tions of the different sources that compose them. 

Statistically Independent: Given two events, the 
occurrence of one event makes it neither more nor less 
probable that the other occurs. 
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INTRODUCTION 

Artificial Neural Networks have proven, along the last 
four decades, to be an important tool for modelling of 
the functional structures of the nervous system, as well 
as for the modelling of non-linear and adaptive systems 
in general, both biological and non biological (Haykin, 
1999). They also became a powerful biologically 
inspired general computing framework, particularly 
important for solving non-linear problems with reduced 
formalization and structure. At the same time, methods 
from the area of complex systems and non-linear dy- 
namics have shown to be useful in the understanding 
of phenomena in brain activity and nervous system 
activity in general (Freeman, 1992; Kelso, 1995). 
Joining these two areas, the development of artificial 
neural networks employing rich dynamics is a growing 
subject in both arenas, theory and practice. In particu- 
lar, model neurons with rich bifurcation and chaotic 
dynamics have been developed in recent decades, for 
the modelling of complex phenomena in biology as 
well as for the application in neuro-like computing. 
Some models that deserve attention in this context are 
those developed by Kazuyuki Aihara (1990), Nagumo 
and Sato (1972), Walter Freeman (1992), K. Kaneko 
(2001), and Nabil Farhat (1994), among others. The 
following topics develop the subject of Chaotic Neural 
Networks, presenting several of the important models 
of this class and briefly discussing associated tools of 
analysis and typical target applications. 



BACKGROUND 

Artificial Neural Networks (ANNs) is one of the impor- 
tant frameworks for biologically inspired computing. 
A central characteristic in this paradigm is the desire 
to bring to computing models some of the interesting 
properties of the nervous system such as adaptation, 
robustness, non-linearity, and the learning through 
examples. 



When we focus on biology (real neural networks), 
we see that the signals generated in real neurons 
are used in different ways by the nervous system to 
code information, according to the context and the 
functionality (Freeman, 1992). Because of that, in ANNs 
we have distinct model neurons, such as models with 
graded activity based on frequency coding, models with 
binary outputs, and spiking models (or pulsed models), 
among others, each one giving emphasis to different 
aspects of neural coding and neural processing. Under 
this scenario, the role of neurodynamics is one of the 
target aspects in neural modelling and neuro-inspired 
computing; some model neurons include aspects of 
neurodynamics, which are mathematically represented 
through differential equations in continuous time, or 
difference equations in discrete time. As described in 
the following topic, dynamic phenomena happen at 
several levels in neural activity and neural assembly 
activity (in internal neural structures, in simple networks 
of interacting neurons, and in large populations of 
neurons). The model neurons particularly important for 
our discussion are those that emphasize the relationship 
between neurocomputing and non-linear dynamical 
systems with bifurcation and rich dynamic behaviour, 
including chaotic dynamics. 



NEUROCOMPUTING AND THE ROLE 
OF RICH DYNAMICS 

The presence of dynamics in neural functionality hap- 
pens even at the more detailed cellular level: the well 
known Hodgkin and Huxley model for the generation 
and propagation of action potentials in the active mem- 
brane of real neurons is an example; time dependent 
processes related to synaptic activity and the post 
synaptic signals is another example. Dynamics also 
appears when we consider the oscillatory behaviour in 
real neurons under consistent stimulation. Additionally, 
when we consider neural assemblies, we also observe 
the emergence of important global dynamic behaviour 
for the production of complex functions. 
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As discussed ahead, non-linearity is an essential 
ingredient for complex functionality and for complex 
dynamics; there is a clear contrast between linear 
dynamic systems and non-linear dynamic systems, in 
what respect their potential for the production of rich 
and diverse behaviour. 

Role of Non-Linear Dynamics in the 
Production of Rich Behaviour 

In linear dynamical systems, both in continuous time 
and in discrete time, the autonomous dynamical behav- 
iour is completely characterized through the system's 
natural modes, either the harmonic oscillatory modes, 
or the exponentially decaying modes (in the theory 
of linear dynamical systems, these are represented by 
frequencies and complex frequencies). The possible 
dynamic outcomes in linear systems are thus limited 
to the universe of linear combinations of these natural 
modes. These modes can have their properties of ampli- 
tudes and frequencies controlled through parameters of 
the system, but not their central properties such as the 
nature of the produced waveforms. Since the number 
of natural modes of linear systems is closely related 
to the number of state variables, we have that small 
networks (of linear dynamic elements) can produce 
only limited diversity of dynamical behaviour. 

The scenario becomes completely different in non- 
linear systems. Non-linearity promotes rich dynamic 
behaviour, obtained by changing the stability and 
instability of different attractors. These changes give 
place to bifurcation phenomena (transitions between 
dynamic modalities with distinct characteristics) 
and therefore to diversity of dynamic behaviour. In 
non-linear systems, we can have a large diversity of 
dynamical behaviours, with the potential production 
of infinite number of distinct waveforms (or time 
series, for discrete time systems). This can happen for 
systems with very reduced number of state variables: 
just three in continuous time, or just one state variable 
in discrete time, are enough to allow bifurcation among 
different attractors and potential cascades of infinite 
bifurcations leading to chaos. In our context, this 
means obtaining rich attractor behaviour even from 
very simple neural networks (i.e., networks with a 
small number of neurons). 

In summary, the operation of chaotic neural 
networks explores the concepts of attractors, repellers, 
limit cycles, and stability (see the topic Terms and 



Definitions for details on these concepts) of trajectories 
in the multidimensional state space of the neural 
network, and more specifically, the dense production 
of destabilization of cyclic trajectories with cascading 
to chaotic behaviour. This scenario allows for the blend 
of ordered behaviour and chaotic dynamics, and the 
presence of fractal structure and self-similarity in the 
rich landscape of dynamic attractors. 



MODEL NEURONS WITH RICH 
DYNAMICS, BIFURCATION AND CHAOS 

We can look at chaotic elements that compose neuro-like 
architectures from several different perspectives. They 
can be looked at as emergent units with rich dynamics 
that are produced by the interaction of classical model 
neurons, such as the sigmoidal model neurons based 
on frequency coding (Haykin, 1999), or the integrate 
and fire spiking model neurons (Farhat, 1994). They 
can also correspond to the modelling of dynamical 
behaviour of neural assemblies, approached as a unity 
(Freeman, 1992). Finally, they can be tools for approxi- 
mate representation of aspects of complex dynamics 
in the nervous system, paying attention mainly to the 
richness of attractors and blend of ordered and erratic 
dynamics, and not exactly to the details of the biological 
dynamics (DelMoral, 2005; Kaneko, 2001). Ahead we 
describe briefly some of the relevant model neurons in 
the context of chaotic neural networks. 

Aihara's Chaotic Model Neuron. One important 
work in the context of chaotic neural networks is the 
model neuron proposed by Kazuyuki Aihara and col- 
laborators (1990). In it, we have self-feedback of the 
neuron's state variable, for representing the refrac- 
tory period in real neurons. This makes possible rich 
bifurcation and cascading to chaos. His work extends 
previous models in which some elements of dynamics 
were already present. In particular, we have to men- 
tion the work by Caianiello (1961), in which the past 
inputs have impact on the value of the present state of 
the neuron, and the work by Nagumo and Sato (1972), 
which incorporates an exponential decay memory. 
Aihara 's model included memory for the inputs of the 
model neuron as well as for its internal state. It also 
included continuous transfer functions, an essential 
ingredient for rich bifurcation, fractal structure and 
cascading to chaos. Equation 1 shows a simplified form 
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of this model: x. is the node state, while the x. regard 
neighboring nodes, t is discrete time, f a continuous 
function, k and k r decay constants, and w.. generic 
coupling strengths: 



x t (t + l)=f 



£w,£kfo(t-d)-a5> r d x,(t-d) 

(1) 



Adachi's Associative Memory. Another proposal 
that can be mentioned here is that of Adachi, co- 
authored with Aihara. It uses Aihara's chaotic neuron 
for the implementation of associative memories, with 
coupling strengths among nodes, w.. ? given by Hebbian- 
like correlation measures (Adachi, 1997). Equation 
2 defines w.. for this model, in a memory storing M 
binary strings x p . 



M 



w u =Z( x / p_x X x J" x ) 



(2) 



p=i 



Nabil Farhat's Bifurcating Neuron. Nabil Farhat 
and collaborators introduced the "Bifurcation Neuron", 
in which the phenomena of bifurcation and cascade 
to chaotic dynamics emerge from a pulsed model of 
type "Integrate and Fire" (Farhat, 1994). This work 
and some of its following developments have direct 
relationship to a class of association and pattern 
recovery architectures developed by the author of this 



article: chaotic neural networks based on Recursive 
Processing Elements, or RPEs (DelMoral, 2005 ; 
DelMoral, 2007). RPEs are parametric recursions that 
are coupled through modulation of their bifurcation 
parameters (we say we have parametric coupling), 
for the formation of meaningful collective patterns. 
Node dynamics is mathematically defined through a 
first order parametric recursion: 




x(t + l) = RJx(t) 



example: x(t + I) = p. [x(t).(l-x(t))] 



(3a) 



(3b) 



This recursion R (parameterized by the "bifurcation 
parameter" p) links consecutive values of the state 
variable x, which evolves in discrete time t. It is 
interesting to comment that first order non-linear 
recursions are very simple mathematical systems 
with very rich dynamical behavior (Fig.l shows an 
illustrative bifurcation diagram). 

Kunihiko Kaneko's Coupled Map Lattices - 
CMLs. These structures, initially conceived for the 
modelling of spatio-temporal physical systems, employ 
the idea of chaotic recursive maps (similar to the RPEs 
above described) that interact through diffusive-like 
coupling (Kaneko, 2001). Equation 4 represents a 
CML: i identifies a node of a linear lattice, and 8 the 
amount of coupling. 



Figure 1. Bifurcation diagram for the logistic recursion, where Rp(x) = p.x.(l-x) 

Bifurcation Diagram for the Logistic Map 




2.5 3 3.5 

Value of the Bifurcation Parameter: P Value 
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X / (t + l) = (l-8)^ p (x / (t)) + |.^(x / _ 1 (t)) + |^ p (x /+1 (t)) 

(4) 



Non-Linear Dynamics Tools 

The classical tools developed for the study of non-linear 
dynamic systems with bifurcation and diverse behaviour 
(Hilborn, 1 994) are important elements in the study and 
characterization of chaotic neural networks (Kaneko, 
2001 ; DelMoral, 2007). Among the important ones, 
we can mention the Bifurcation Diagrams (see Fig.l), 
which are useful for the representation of the long-term 
behavior in parametric non-linear dynamical systems, 
as well as for the representation of their bifurcations 
and their parameter ranges for ordered behavior and 
chaotic behavior. We can also mention the Lyapunov 
exponents, for the quantitative evaluation of sensibil- 
ity to initial conditions, the Entropy measures, for the 
quantification of trajectory complexity, the Return 
Maps, for the characterization of recursive rules, and 
the Web Diagrams, for the illustration of attractor and 
repeller trajectories (Hilborn, 1994 ; Devaney, 1989 ; 
Kaneko, 2001). 



COLLECTIVE DYNAMIC BEHAVIOUR, 
ATTRACTOR NETWORKS AND 
APPLICATION SCENARIOS 

A more complex and richer scenario can be created 
through the coupling of several units having rich dy- 
namic behavior at the single node level. The follow- 
ing paragraphs detail some of the emergent collective 
phenomena that appear in networks of coupled chaotic 
elements and are explored for information coding and 
processing: 

Multidimensional attr actors. In the multidimen- 
sional attractor behavior, similarly to attractors at 
the single node level (see entry in the topic Terms 
and Definitions, at the end of this article), we have 
the evolution in time of the network state towards 
a limited repertoire of preferential collective 
trajectories, which emerge in the long-term. The 
concept of Multidimensional Attractors is central 
to the Attractor Networks paradigm: high dimen- 



sional dynamical systems whose long-term states 
represent relevant information. In the particular 
case of chaotic neural networks, the evolution 
of the network state is usually composed by an 
initial chaotic phase and a gradual approximation 
to ordered limit cycles (Kaneko, 2001 ; DelMoral, 
2005 ; Freeman, 1992). 

Clustering of nodes' activities. Here we have, 
due to coupling and network self organization, 
the formation of groups of nodes exhibiting 
activities which are identical or similar in some 
sense (Kaneko, 2001). 

Synchronization of the nodes' cycling (or phase 
locking). In this type of collective phenomena, 
which is a particular case of clustering, the cyclic 
activities of nodes belonging to a cluster have 
the same period, and they operate with constant 
relative phases (DelMoral, 2005 ; DelMoral, 
2007). 

With collective structures (multiple coupled 
neurons), complex functionalities can potentially 
be implemented, through the exploration of the 
multidimensional nature of the state variables: 
image understanding, processing of multiple sensory 
information, multidimensional logical reasoning, 
complex motor control, memory, association, hetero- 
association, decision making and pattern recognition 
(DelMoral, 2007). In networks of coupled elements 
with rich dynamics, the above collective phenomena 
are explored for the representation and processing 
of meaningful information. A stored memory, in 
association and pattern recovery tasks, or a class 
label, in pattern recognition tasks, for example, can be 
represented through the specific collective attractors 
of the network (DelMoral, 2005). Concretely, the 
representation of information can happen through 
different quantitative features of the relevant attractors. 
We can have the coding of analog information through 
the amplitude of oscillations of the state variables, or 
even through the sequences of values visited by the state 
variables in limit cycles. We can also have the coding of 
class labels through the periods of closed trajectories, 
through the phase of cycling of closed trajectories, 
or even through mixed forms involving several of 
these coding modalities. In addition, clustering and 
synchronization can be used for spatial segmentation 
of information. 
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Blend of Order and Complex Dynamics CONCLUSION 



A macroscopic phenomenon that relates to the global 
behavior of a coupled structure composed of several 
neurons with rich dynamics (as the models described 
in previous topics) is the interplay between ordered 
behavior and disordered behavior. In many circum- 
stances, we can look at the network's state evolution 
as switching between situations of ordered behavior 
and situations of apparently erratic behavior. The blend 
of ordered and erratic behaviour is explored for the 
representation of meaningful information (order) and 
the rich search in the state space for stored patterns 
(chaotic search). 

This blend of ordered and erratic behaviour 
appears in many different classes of model neurons 
and associated architectures, such as for example, 
the Bifurcation Neuron, K Sets structures, RPEs 
architectures and many others. 



FUTURE TRENDS 

Since chaotic neural networks are a relatively recent 
subject of research, there are many different directions 
in which the field can potentially progress. We will just 
mention some of these directions. 

An important possibility that we can identify is 
the exploration of rich dynamics, fractal structure and 
diversity of dynamical behaviour, in the modelling 
and emulation of higher cognitive functions. We can 
mention for example that part of the current research 
on consciousness and cognition addresses the possible 
roles of complex dynamics on these high level functions 
(Perlovsky & Kozma, 2007). 

We also see spiking model neurons, a fast growing 
research area, and neural oscillators as natural scenarios 
for the emergence of rich dynamic phenomena and as 
potential substrates for computing with rich dynamics 
(Gerstner, 2002). We add that there are several efforts 
for the implementation of spiking models in electronic 
form (DelMoral, 2003); these efforts can also have 
an important role in the context of brain-computer 
interfaces based on microelectrode arrays and associated 
electronics, another fast growing research area. 



Chaotic Neural Networks and the associated chaotic 
model neurons discussed here are part of a current 
trend in neural modelling and artificial neurocomputing 
that moves the emphasis of artificial model neurons, 
from functional analysis and functional synthesis, 
particularly evident in neural architectures such as the 
MLPs, to a more balanced blend which involves also 
elements of neurodynamics and explores extensively 
the paradigm of attractor networks. With this ongoing 
move, the tendency is to have more powerful model- 
ling tools for the study of the nervous system and more 
powerful elements for the development of neuro-like 
computing environments. 
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KEY TERMS 

Artificial Neurons / Model Neurons : Mathematical 
description of the biological neuron, in what respects 
representation and processing of information. These 
models are the processing elements that compose an 
artificial neural network. In the context of chaotic neural 
networks, these models include the representation of 
aspects of complex neurodynamics. 

Attractors, Repellers and Limit Cycles: These 
three concepts are related to the concept of dynamic 
modality and they regard the long-term behaviour of 
a dynamical system. Attractors are trajectories of the 
system state variable that emerge in the long-term, 
with relative independence with respect to the exact 
values of the initial conditions. These long-term traj ec- 
tories can be either a point in the state space (a static 
asymptotic behaviour), named fixed-point, a cyclic 
pattern (named limit cycle), or even a chaotic trajec- 
tory. Repellers correspond, qualitatively speaking, to 
the opposite behaviour of attractors : given a fixed-point 
or a cyclic trajectory of a dynamic system, they are 
called repeller-type trajectories if small perturbations 
can make the system evolve to trajectories that are far 
from the original one. 

Bifurcation and Diverse Dynamics: The concept 
of bifurcation, present in the context of non-linear 
dynamic systems and theory of chaos, refers to the 
transition between two dynamic modalities qualitatively 
distinct; both of them are exhibited by the same dynamic 
system, and the transition (bifurcation) is promoted by 
the change in value of a relevant numeric parameter 
of such system. Such parameter is named "bifurcation 
parameter", and in highly non-linear dynamic systems, 
its change can produce a large number of bifurcations 
between distinct dynamic modalities, with self-similar- 
ity and fractal structure. In many of these systems, we 
have a cascade of numberless bifurcations, culminating 
with the production of chaotic dynamics. 

Chaotic Dynamics: Dynamics with specific fea- 
tures indicating complex behaviour, only produced in 
highly non-linear systems. These indicative features, 
formalized by the discipline of "Theory of Chaos", 
are high sensibility to initial conditions, non-periodic 
behaviour, and production of a large number of different 
trajectories in the state space, according to the change 
of some meaningful parameter of the dynamical system 
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(see bifurcation and diverse dynamics ahead). For some 
of the tools related to chaotic dynamics, see the related 
topic in the main text: Non-linear dynamics tools. 

Chaotic Model Neurons: Model neurons that in- 
corporate aspects of complex dynamics observed either 
in the isolated biological neuron or in assemblies of 
several biological neurons. Some of the models with 
complex dynamics, mentioned in the main text of this 
article, are the Aihara's model neuron, the Bifurcation 
Neuron proposed by Nabil Farhat, RPEs networks, 
Kaneko's CMLs, and Walter Freeman's K Sets. 



Spatio-Temporal Collective Patterns: The ob- 
served dynamic configurations of the collective state 
variable in a multi neuron arrangement (network). The 
temporal aspect comes from the fact that in chaotic 
neural networks the model neurons' states evolve in 
time. The spatial aspect comes from the fact that the 
neurons that compose the network can be viewed as 
sites of a discrete (grid-like) spatial structure. 

Stability: The study of repellers and attractors 
is done through stability analysis, which quantifies 
how infinitesimal perturbations in a given trajectory 
performed by the system are either attenuated or am- 
plified with time. 
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INTRODUCTION 

Machine learning has provided powerful algorithms 
that automatically generate predictive models from 
experience. One specific technique is supervised learn- 
ing, where the machine is trained to predict a desired 
output for each input pattern x. This chapter will focus 
on classification, that is, supervised learning when the 
output to predict is a class label. For instance predict- 
ing whether a patient in a hospital will develop cancer 
or not. In this example, the class label c is a variable 
having two possible values, "cancer" or "no cancer", 
and the input pattern x is a vector containing patient 
data (e.g. age, gender, diet, smoking habits, etc.). In 
order to construct a proper predictive model, supervised 
learning methods require a set of examples x. together 
with their respective labels c. This dataset is called 
the "training set". The constructed model is then used 
to predict the labels of a set of new cases x. called the 
"test set". In the cancer prediction example, this is 
the phase when the model is used to predict cancer in 
new patients. 

One common assumption in supervised learning 
algorithms is that the statistical structure of the train- 
ing and test datasets are the same (Hastie, Tibshirani 
& Friedman, 2001). That is, the test set is assumed 
to have the same attribute distribution p(x) and same 
class distribution p(c|x) as the training set. However, 
this is not usually the case in real applications due to 
different reasons. For instance, in many problems the 
training dataset is obtained in a specific manner that 
differs from the way the test dataset will be generated 
later. Moreover, the nature of the problem may evolve 
in time. These phenomena cause p Tr (x, c) ^ p Test (x, 
c), which can degrade the performance of the model 
constructed in training. 

Here we present a new algorithm that allows to 
re-estimate a model constructed in training using the 



unlabelled test patterns. We show the convergence 
properties of the algorithm and illustrate its performance 
with an artificial problem. Finally we demonstrate its 
strengths in a heart disease diagnosis problem where 
the training set is taken from a different hospital than 
the test set. 



BACKGROUND 

In practical problems, the statistical structure of training 
and test sets can be different, that is, p Tr (x, c) ^ p Test (x, 
c). This effect can be caused by different reasons. For 
instance, due to biases in the sampling selection of 
the training set (Heckman, 1979; Salganicoff, 1997). 
Other possible cause is that training and test sets can 
be related to different contexts. For instance, a heart 
disease diagnosis model that is used in a hospital which 
is different from the hospital where the training dataset 
was collected. Then, if the hospitals are located in cit- 
ies where people have different habits, average age, 
etc., this will cause a test set with a different statistical 
structure than the training set. 

The special case p Tr (x) ^ p Test (x) and p Tr (c | x) = p Test (c 
| x) is known in the literature as "covariate shift" (Shi- 
modaira, 2000). In the context of machine learning, the 
covariate shift can degrade the performance of standard 
machine learning algorithms. Different techniques have 
been proposed to deal with this problem, see for example 
(Heckman, 1979; Salganicoff, 1997; Shimodaira,2000; 
Sugiyama, Krauledat & Miiller, 2007). Transductive 
learning has also been suggested as another way to 
improve performance when the statistical structure of 
the test set is shifted with respect to the training set 
(Vapnik, 1998; Chen, Wang & Dong, 2003; Wu, Ben- 
nett, Cristianini & Shawe-Taylor, 1999). 

The statistics of the patterns x can also change in 
time, for example in a company that has a continuous 
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Figure 1. Changes across time of the statistics of clients in a car insurance company. The histograms of two dif- 
ferent variables (a, b) related to the clients' use of their insurance are shown. Dash: data collected four months 
later than data shown in solid. 
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flow of new and leaving clients (figure 1). If we are 
interested in constructing a model for prediction, the 
statistics of the clients when the model is exploited will 
differ from the statistics in training. Finally, often the 
concept to be learned is not static but evolves in time 
(for example, predicting which emails are spam or not), 
causing p Tr (x, c) ^ p Test (x, c). This problem is known 
as "concept drift" and different algorithms have been 
proposed to cope with it (Black & Hickey, 1 999; Wang, 
Fan, Yu, & Han, 2003; Widmer & Kubat, 1996). 



test sets are different. This technique can be used in 
problems where concept drift, sampling biases, or 
any other phenomena exist that cause the statistical 
structure of the training and test sets to be different. 
On the other hand, our strategy constructs an explicit 
estimation of the statistical structure of the problem in 
the test data set. This allows us to construct a classifier 
that is optimized with respect to the new test statistics, 
and provides the user with relevant information about 
which aspects of the problem have changed. 



A NEW ALGORITHM FOR 
CONSTRUCTING CLASSIFIERS 
WHEN TRAINING AND TEST SETS 
HAVE DIFFERENT DISTRIBUTIONS 

Here we present a new learning strategy for problems 
where the statistical distributions of the training and 



Algorithm 

1. Construct a statistical model { p(x | c) , P(c) } for 
the training set using a standard procedure (for 
example, using the standard EM algorithm). 

2. Re-estimate this statistical model using the non- 
labelled patterns x of the test set. For this purpose, 
we have developed a semi-supervised extension 
of EM. 
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3. Use this re-estimated statistical model {P'(x|c), 
P\c)} to construct a classifier optimized to the test 



set. 



Model Re-Estimation: 

A Semi-Supervised Extension of EM 

The standard EM algorithm (Dempster, Laird & Rubin, 
1977) is an iterative procedure designed to find the 
optimal parameters of a statistical model in a maximum 
likelihood sense. Here we present an extension of the 
EM algorithm that re-estimates the statistical model 
learned in training using the unlabelled test set, under 
the assumption that it should resemble the statistical 
model learned at training. That is, the algorithm finds 
the minimum amount of change that one has to assume 
for the model constructed in training (where we know 
the pattern classes) to explain the global distribution 
of attributes x in the test set. 

Let us call the set of different parameters in the 
statistical model of our problem. For example, if we 

model P(x| c = l) and P(x| c = 2) by a mixture of 
two and three Gaussians respectively, then would 
be composed by the averages, covariance matrices and 
likelihoods of the 2+3 different Gaussians in the model. 
An estimation 0Tr should be first made in the training 
set using a standard technique such as applying EM 
for each individual class. Then, we should recalculate 
it as 0Te using the unlabelled test set by optimizing the 
likelihood P(0Te| T>Te,0Tr) where DTe is the unlabelled 
test set (test patterns without the class information). 
The maximization of this quantity respect to 0Te is 
equivalent to maximizing the quantity 

L'=lnP(DTe| 0Te) + lnP(0Te| 0Tr) 

which we will call the "extended log-likelihood". 
The term P(0Te| 0ty) implements the bias in the re- 
estimation of the parameters, which in our case is a 
preference for small changes. Using this extended 
log-likelihood it is possible to derive a new version of 
EM that maximizes /_' . 

To achieve it we consider, as in standard EM ap- 
plications, the existence of additional but latent (or 
'hidden') variables h in the problem (Dempster, Laird 
& Rubin, 1977). For example, in case we model the 
statistics as a mixture of Gaussians, the latent variable 



indicates which of the Gaussians actually generated 
the pattern. The parameters of our statistical model 
are then of two types: those a affecting the probability 
distributions of the hidden variables h, and those p af- 
fecting the rest of parameters of the model. Therefore, 
= {a, p}. For instance, if the statistical model is a 
mixture of Gaussians, a contains the prior probabilities 
of the different Gaussians, and p is composed by the 
averages and covariance matrices. Finally, we assume 
that the penalization term can be written as: 

lnP(0Te| 0Tr) = lnP(aTe | ttTr) + lnP(p T e | p T r) 

We are now in a position to develop the semi-super- 
vised extension of EM. Following the same scheme of 
reasoning as in exercise 44 of chapter 3 in (Duda, Hart 
& Stork, 2001) we arrive at the following algorithm: 

1. Initialize a'Te <- a Tr and P'Te <- p Tr 

2. (E step): Compute for all h and x : 



P(h| Xi, a'Te, P'Te) = 



P(xj|h,p'Te)-P(h|a' T e) 

£ P(»| h\ P'Te) -P(h'| a'Te) 



3. (M step): 

Calculate the a * Te that maximizes the following 
quantity (fixing a'Te and p^e): 

lnP(a*Te| a Tr ) + ^ P(h| Xi,a'Te,p' T e)-lnP(h| a* Te ) 

h,xi 

Calculate the p * Te that maximizes the following 
quantity (fixing a'Te and p^e): 

lnP(P*Te| PTr) + ^ P(h| Xi, tt ' T e, P ' T e) • In P(Xi| h, P *Te) 

h, xi 



4. Update the parameters: a'Te^a*Te and 

P'Te^P*Te 

5. Go to step 2 until convergence of V 

In this derivation it is also guaranteed that V does 
not decrease at each step, so our algorithm will find 
at least a local optimum of L'. On the other hand, the 
penalization term will ensure the stability of the algo- 
rithm (small changes in test data lead to small changes 
in the re-estimation), and will allow to associate the 
correct class to the different clusters. Our algorithm 
contains as a special case the standard EM algorithm 
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when the distribution P(0Te| 0ty) is not considered or 
is assumed to be homogeneous. 

In the examples we will show in this chapter we 
use a simple case of this algorithm where the statisti- 
cal model consists in one Gaussian per class, and only 
the averages of the Gaussians are re-estimated. The 
penalization term In P (0Te | 0Tr ) used in these examples 
is proportional to the Mahalanobis distance (Duda, 
Hart & Stork, 2001) between the averages estimated 
in training and those re-estimated in test (we will refer 
to the proportional factor as y). Then we arrive at the 
following simplified algorithm: 



1. 

2. 



Initialize \i\ Te <- \i i, Tr and ^2, Te <- \i 2, Tr 

(E step): Compute for c=l,2 and all x. in the test 

dataset: 



P(c|xi)=- 



P(C)-P(X| fl'c,Te, Mc 



P(l)-/5(Xi| |i'l,Te l Ml)+P(2)-j5(»| fl'2,Te f M2) 



3. (M step): 



\l c, Te <^~ 



Y -(licTr +d) + V P(C\ Xi) -Xl 

JNlTe vi 



and go to step 2 until convergence of L\ 



where ce (1, 2} is the class of the pattern; P(c) is the 
prior probability of class c estimated in the training 
set; Hc,Tr and (ic,Te are the averages of the patterns of 
class c in training and test sets respectively; M c is the 
covariance matrix of class c estimated in the training 
set; p(x| n, M) is the probability density function of 
x given that it has been generated by a Gaussian pro- 
cess of average |ll and covariance matrix M; d is the 
substraction of the global average of the attributes in 
test to the global average of the attributes in training; 
and N Te is the number of patterns in test. 

Application to a Synthetic Problem 

First we will illustrate our algorithm using a simple 
synthetic problem with two classes (figure 2). In training, 
class "grey" was generated from a Gaussian distribution 
with average [1.0 ; 1.0] and covariance matrix [0.5, 0.0 



; 0.0, 0.5]; class "black" was generated from a Gauss- 
ian distribution with average [1.0 ; 1.5] and covariance 
matrix [0.5, 0.4 ; 0.4, 0.5]. The statistical structure of 
the test set is different than in training due to a shift 
of class "black" (figure 2a and 2b), now with average 
[2.0 ; 1.0]. The minimum Bayes' errors of training and 
test sets are 27.7% and 18.8% respectively. 

First we construct a statistical model of the training 
set consisting in one Gaussian for each class (figure 
2c). This statistical model is then used to construct a 
classifier based on Linear Discriminant Analysis (LDA). 
The error of this classifier in training is 34.1%. When 
applied to test set, the error of this classifier increases 
to 66.0%. Our algorithm was then used to re-estimate 
the statistical model constructed in training using the 
unlabelled test set patterns (figure 2d). We can observe 
that our algorithm finds the appropriate statistical model 
for the test set (figure 2e). In fact, when we recalculated 
the LDA classifier using the re-estimated statistical 
model, the error in test decreased to 20.5%. 

Application to Heart Disease Diagnosis 
when the Training and Test Hospitals are 
Different 

We tested our algorithm using the Heart disease database 
from the UCI machine learning repository (Asuncion 
& Newman, 2007). The goal is to predict whether a 
patient has a heart disease or not, given some personal 
data (age, sex, smoking habits, etc.) and the results of 
several medical examinations such as blood pressure 
and electro cardiograms. The Heart disease database 
is in fact the union of four datasets, each one obtained 
at a different hospital. One would expect that the sta- 
tistical structure of the problem is different in different 
hospitals since the patients live in different cities (and 
thus the environmental factors can be different), the 
measurement devices are not exactly the same, etc. 
Thus we expect that our algorithm will improve the 
classification performance of a model constructed in 
a different hospital. 

We have checked this hypothesis using the data from 
the Cleveland Clinic Foundation as training, and the data 
from the Hungarian Institute of Cardiology as test. In 
both cases we removed the attributes in columns 12 and 
13 since they are frequently missing in the Hungarian 
dataset, and performed a normalization in each set so 
the attributes have zero mean and unit variance. The 
result is 297 examples from the Cleveland Clinic (149 
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Figure 2. Re-estimation of the statistical model of the problem in a synthetic example. The two different classes 
are shown as two different clouds (grey and black), a, b: the statistical structure of the problem in training is 
different than in test, c: statistical model constructed from the training set (dashed: model for class "grey"; 
solid: model for class "black"), d: recalculation of the statistical model using the unlabelled test set (note that 
we have drawn all patterns in grey since the re-estimation algorithm does not have access to the classes of the 
test patterns), e: Re-estimated statistical model superposed with the labelled test set (dashed: model for class 
"grey"; solid: model for class "black"). 
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of class 1, 148 of class 2), and 294 from the Hungarian 
Institute of Cardiology (188 of class 1, 106 of class 2). 
First, we use PCA to reduce the dimensionality of the 
problem in the training set. Therefore the statistical 
model of our problem now consists of the average and 
covariance matrix of each class in the principal axes. 
Then, a simple classifier that assigns to pattern x the 
class with nearest center is considered. The number of 



principal components is automatically selected on the 
basis of the performance of the classifier in a random 
subset of the training dataset which was reserved for 
this task. Once the statistical model of the training set is 
constructed, we evaluated the performance of the clas- 
sifier in the test set obtaining an error rate of 18.7%. 

In a second step we used our algorithm for re-es- 
timating the statistical model using y=0.01, and then 
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classified the test patterns using the same procedure as 
before, that is, assigning to each test pattern the class 
with nearest center. The error was reduced in this case 
to 15.3 %. If we repeat the same strategy using a more 
elaborated classifier such as one based on Linear Dis- 
criminant Analysis, we observe a reduction in the error 
from 17.0% to 13.9% after using our algorithm. 



FUTURE TRENDS 

In the examples we have presented here we have used 
a simplified version of our algorithm that assumes a 
statistical model consisting of one Gaussian for each 
class, and that only the averages need to be re-estimated 
with test. However, our semi-supervised extension 
of EM is a general algorithm that is applicable with 
arbitrary parametric statistical models (e.g. mixtures 
of an arbitrary number of non Gaussian models), and 
allows to re-estimate any parameter of the model. 
Future work will include the study of the performance 
and robustness of different types of statistical models 
in practical problems. On the other hand, since our 
approach is based on an extension of the extensively 
studied standard EM algorithm, we expect that analyti- 
cal results such as generalization error bounds can be 
derived, making the algorithm attractive also from a 
theoretical perspective. 



CONCLUSION 

We have presented a new learning strategy for clas- 
sification problems where the statistical structure 
of training and test sets are different. In this kind of 
problems, the performance of traditional machine 
learning algorithms can be severely degraded. Dif- 
ferent techniques have been proposed to cope with 
different aspects of the problem, such as strategies for 
addressing the sample selection bias (Heckman, 1979; 
Salganicoff, 1997; Shimodaira, 2000; Zadrozny, 2004; 
Sugiyama, Krauledat & Miiller, 2007), strategies for 
addressing the concept drift (Black & Hickey, 1999; 
Wang, Fan, Yu & Han, 2003; Widmer & Kubat, 1996) 
and transductive learning (Vapnik, 1 998; Wu, Bennett, 
Cristianini & Shawe-Taylor, 1999). 

The learning strategy we have presented allows to 
address all these different aspects, so that it can be used 
in problems where concept drift, sampling biases, or any 



other phenomena exist that cause the statistical structure 
of training and test sets to be different. Moreover, our 
algorithm constructs an explicit statistical model of 
the new structure in the test dataset, which is of great 
value for understanding the dynamics of the problem 
and exploiting this knowledge in practical applications. 
To achieve this we have extended the EM algorithm 
obtaining a semi-supervised version that allows to re- 
estimate the statistical model constructed in training 
using the attribute distribution of the unlabelled test 
set patterns. 
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KEY TERMS 

Attribute: Each of the components that constitute 
an input pattern. 

Classifier: Function that associates a class c to each 
input pattern x of interest. A classifier can be directly 
constructed from a set of pattern examples with their 
respective classes, or indirectly from a statistical model 

P(x,c). 



EM (Expectation-Maximization Algorithm): 

standard iterative algorithm for estimating the param- 
eters of a parametric statistical model. EM finds the 
specific parameter values that maximize the likelihood 
of the observed data D given the statistical model, 

P(D 1 0). The algorithm alternates between the Expec- 
tation step and the Maximization step, finishing when 

P(D 1 0) meets some convergence criterium. 

Missing Value: Special value of an attribute that 
denotes that it is not known or can not be measured. 

Semi-Supervised Learning: Machine learning 
technique that uses both labelled and unlabelled data 
for constructing the model. 

Statistical Model: Mathematical function that mod- 
els the statistical structure of the problem. For classifica- 
tion problems, the statistical model is P(x, c) or equiva- 
lently { p( x | c) , P(c) } since p( x , c) = P(x | c) • P(c) . 

Supervised Learning: Type of learning where the 
objective is to learn a function that associates a desired 
output (' label') to each input pattern. Supervised learn- 
ing techniques require a training dataset of examples 
with their respective desired outputs. Supervised learn- 
ing is traditionally divided into regression (the desired 
output is a continuous variable) and classification (the 
desired output is a class label). 

Training/Test Sets: In the context of this chapter, 
the training set is composed by all labelled examples 
that are provided for constructing a classifier. The test 
set is composed by the new unlabelled patterns whose 
classes should be predicted by the classifier. 
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INTRODUCTION 

Important insights into gene function can be gained by 
gene expression analysis. For example, some genes are 
turned on (expressed) or turned off (repressed) when 
there is a change in external conditions or stimuli. 
The expression of one gene is often regulated by the 
expression of other genes. A detail analysis of gene 
expression information will provide an understanding 
about the inter-networking of different genes and their 
functional roles. 

DNA microarray technology allows massively 
parallel, high throughput genome-wide profiling of 
gene expression in a single hybridization experiment 
[Lockhart & Winzeler, 2000]. It has been widely used 
in numerous studies over a broad range of biological 
disciplines, such as cancer classification (Armstrong et 
al., 2002), identification of genes relevant to a certain 
diagnosis or therapy (Muro et al., 2003), investigation 
of the mechanism of drug action and cancer prognosis 
(Kim et al., 2000; Duggan et al., 1999). Due to the large 
number of genes involved in microarray experiment 
study and the complexity of biological networks, clus- 
tering is an important exploratory technique for gene 
expression data analysis. In this article, we present a 
succinct review of some of our work in cluster analysis 
of gene expression data. 



BACKGROUND 

Cluster analysis is a fundamental technique in explor- 
atory data analysis (Jain & Dubes, 1988). It aims at 
finding groups in a given data set such that objects in 
the same group are similar to each other while objects in 
different groups are dissimilar. It aids in the discovery of 



gene function because genes with similar gene expres- 
sion profiles can be an indicator that they participate 
in related cellular processes. Clustering of genes may 
suggest possible roles for genes with unknown functions 
based on the known functions of some other genes in 
the same cluster. Clustering of gene expression data 
has been applied to, for example, the study of temporal 
expression of yeast genes in sporulation (Chu et al., 
1998), the identification of gene regulatory networks 
(Chen, Filkov, & Skiena, 1 999), and the study of cancer 
(Tamayo et al., 1999). 

Many clustering algorithms have been applied to 
the analysis of gene expression data (Sharan, Elkon, 
& Shamir, 2002). They can be broadly classified as 
either hierarchical or partition-based depending on how 
they group the data. Hierarchical clustering is further 
subdivided into agglomerative methods and divisive 
methods. The former proceed by successive merging 
of the N objects into larger groups, whereas the latter 
divide a larger group successively into finer group- 
ings. Agglomerative techniques are more common in 
hierarchical clustering. 

Hierarchical clustering is among the first cluster- 
ing technique being applied to gene expression data 
(Eisen et al., 1998). In hierarchical clustering, each 
of the gene expression profile is considered as a clus- 
ter initially. Then, pairs of clusters with the smallest 
distance between them, are merged together to form 
a single cluster. This process is repeated until there 
is only one cluster left. The hierarchical clustering 
algorithm arranges the gene expression data into a 
hierarchical tree structure known as a dendrogram, 
which allows easy visualization and interpretation of 
results. However, the hierarchical tree cannot indicate 
the optimal number of clusters in the data. The user 
has to interpret the tree topologies and identify branch 
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points that segregate clusters of biological relevance. In 
addition, once a data is assigned to a node in the tree, 
it cannot be reassigned to a different node even though 
it is later found to be closer to that node. 

In partition-based clustering algorithms, such as 
K-means clustering (Jain & Dubes, 1988), the number 
of clusters is arbitrarily fixed by the users at start. Set- 
ting the correct number of clusters can be a difficult 
problem and many heuristics are used. The basic idea 
of K-means clustering is to partition the data into a 
predefined number of clusters such that the variability of 
the data within each cluster is minimized. Clustering is 
achieved by first generating K random cluster centroids, 
then alternatingly updating the cluster assignment of 
each data vector and the cluster centroids. The Euclidean 
distance is usually employed in K-means clustering to 
measure the closeness of a data vector to the cluster 
centroids. However, such distance metric inevitably 
imposes an ellipsoidal structure on the resulting clusters. 
Hence, data that do not conform to this structure are 
poorly clustered by the K-means algorithm. 

Other approach to clustering includes model-based 
approach. In contrast to model-free partition-based 
algorithms, model-based clustering uses certain dis- 



tribution models for clusters and attempts to optimize 
the fit between the data and the model. Each cluster is 
represented by a parametric distribution, like a Gaussian, 
and the entire data set is modeled by a mixture of these 
distributions. The most widely used clustering method 
of this kind is the one based on a mixture of Gaussians 
(McLachlan & Basford, 1988; Yeung et al., 2001). 



CLUSTERING OF GENE EXPRESSION 
DATA 

Binary Hierarchical Clustering (BHC): In Szeto et al. 
(2003), we proposed the BHC algorithm for clustering 
gene expression data based on the hierarchical binary 
subdivision framework of Clausi (2002). Consider the 
dataset with three distinct classes as shown in Fig.l. 
The algorithm starts by assuming that the data consists 
of one class. The first application of binary subdivision 
generates two clusters A and BC. As the projection of 
class A and class BC have a large enough Fisher crite- 
rion on the A-BC discriminant line, the algorithm splits 
the original dataset into two clusters. Then, the binary 
subdivision is applied onto each of the two clusters. The 



Figure 1. The binary subdivision framework, (a) Original data treated as one class, (b) Partition into two clus- 
ters, A and BC. (c) Cluster A cannot be split further, but cluster BC is split into two clusters, B and C (d) Both 
cluster B and C cannot be split any more, and we have three clusters A, B, and C (Figure adopted from Clausi 
(2002)) 
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BC cluster is separated into B and C clusters because its 
Fisher criterion is large. However, the Fisher criterion 
of cluster A is too small to allow further division, so it 
remains as a single cluster. Such subdivision process 
is repeated until all clusters have Fisher criterion too 
low for further splitting and the process stops. 

The BHC algorithm proceeds in two steps : ( 1 ) binary 
partition using a mixed FCM-HC algorithm to partition 
the data into two classes, (2) Fisher discriminant analysis 
on the two classes, where if the Fisher criterion exceeds 
a set threshold, we accept the partition; otherwise do 
not subdivide the data any further. 

A novel aspect of the BHC algorithm is the use of 
the FCM-HC algorithm to partition the data. The idea 
is illustrated in Fig.2a-2c. In Fig.2a, there are three 
clusters. The desirable two-class partition would be for 
the two closer clusters to be classified as one partition 
and the remaining cluster as the second partition. How- 
ever, a K-means partitioning of the data would favor 
the split of the middle cluster into two classes. With 
this wrong partition, subsequent Fisher discriminant 
analysis would conclude that the splitting is not valid 
and the data should be considered as just one class. To 
overcome this problem, we first over-cluster the data 
into several clusters by the Fuzzy C-means (FCM) 
algorithm. Then, the clusters are merged together 
by the average linkage hierarchical clustering (HC) 
algorithm until only two classes are left. In Fig.2a, 
the data is over-clustered into six clusters by the FCM 
algorithm. A hierarchical tree constructed from the six 
clusters using the average linkage clustering algorithm 
is constructed as in Fig.2b. Using the cutoff as shown in 



Fig.2b, a final partitioning of the data into two classes is 
shown in Fig.2c. We can see that A, B are merged into 
one class, and C, E, F, D are merged into the second 
class. FCM-HC also allows non-ellipsoidal clusters to 
be partitioned correctly. Fig.2d shows a dataset contain- 
ing two non-ellipsoidal clusters that is over-clustered 
into six clusters. After merging, we obtain the correct 
two-class partitions as shown in Fig.2e. 

The BHC algorithm makes no assumption about 
the class distributions and the number of clusters 
is determined automatically. The only parameter 
required is the Fisher threshold. The binary hierarchical 
framework naturally leads to a tree structure 
representation where similar clusters are placed adjacent 
to each other in the tree. Near the root of the tree, only 
gross structures in the data are shown, whereas near the 
end branches, fine details in the data can be visualized. 
Figure 3 shows the BHC clustering results on Spellman's 
yeast dataset (http://cellcycle-www.stanford.edu). The 
dataset contains expression profiles for 6178 genes 
under different experimental conditions, i.e., cdcl5, 
and cdc28, alpha factor and elutriation experiments. 

Self-Splitting and Merging C ompetitive Learning 
(SSMCL) Clustering: In Wu, Liew, & Yan (2004), 
we proposed a new clustering framework based on 
the one-prototype-take-one-cluster (OPTOC) learn- 
ing paradigm (Zhang & Liu, 2002) for clustering 
gene expression data. The new algorithm is able to 
identify natural clusters in the data as well as provide 
a reliable estimate of the number of distinct clusters in 
the data. In conventional clustering, if the number of 




Figure 2. Partition of the data into two classes, where the original data is clustered into six clusters (a) with the 
resulting hierarchical tree (b), and finally the two-class partition after merging (c). For two non-ellipsoidal clus- 
ters, FCM-HC over-clusters them into 6 classes (d), and the resulting two-class partition after merging (e). 
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Figure 3. BHC clustering of (a) alpha factor experiment dataset, (b) cdcl5 experiment dataset, (c) elutriation 
experiment dataset, and (d) cdc28 experiment dataset. 




prototypes is less than that of the natural clusters in the 
data, then at least one prototype will win patterns from 
more than two clusters. In contrast, the OPTOC idea 
allows one prototype to characterize only one natural 
cluster in the data, regardless of the actual number of 
clusters in the data. This is achieved by constructing 
a dynamic neighborhood such that patterns inside the 
neighborhood contribute more to its learning than those 
outside. Eventually, the prototype settles at the center 
of a natural cluster, while ignoring competitions from 
other clusters as shown in Fig. 4b. 

The SSMCL algorithm starts with a single cluster. 
If the actual number of clusters in the data is more than 
one, additional prototypes are generated to search for 
the remaining clusters. Let C. denotes the center of all 
the patterns that P. wins according to the minimum 



distance rule. The distortion |P.-C.| measures the dis- 
crepancy between the prototype P. found by OPTOC 
learning and the actual cluster structure in the data. 
For example, in Fig. 4b, C 1 would be located at the 
center of the three clusters SI, S2 and S3 (since there 
is only one prototype, it wins all input patterns), while 
V 1 eventually settled at the center of S3. After the pro- 
totypes have all settled down, a large |P.-C.| indicates 
the presence of other natural clusters in the data. A 
new prototype would be generated from the prototype 
with the largest distortion when this distortion exceeds 
a certain threshold s. 

Ideally, with a suitable threshold, the algorithm will 
find all natural clusters in the data. Unfortunately, the 
complex structure exhibited by gene expression data 
makes setting an appropriate threshold difficult. Instead, 
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Figure 4. Two learning methods: conventional versus OPTOC. (a) One prototype takes the arithmetic center of 
three clusters (conventional learning), (b) One prototype takes one cluster (OPTOC learning) and ignores the 
other two clusters. 
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Figure 5. (a) The 22 distinct clusters found by SSMCLfor the yeast cell cycle data, (b) Five patterns that cor- 
respond to the five cell cycle phases. The genes presented here are only those that belong to these clusters and 
are biologically characterized to a specific cell cycle phase. 
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we proposed an over-clustering and merging strategy. 
The over-clustering step minimizes the chance of miss- 
ing any natural clusters in the data, while the merging 
step ensures that the final clusters are all visually distinct 
from each other. In over-clustering, a natural cluster 
might be split into more than one cluster. However, 
no one cluster may contain data from several natural 
clusters, since the OPTOC paradigm discourages a 
cluster from winning data from more than one natural 
cluster. The clusters that are visually similar would 
be merged during the merging step. Together with the 
OPTOC framework, the over-clustering and merging 
framework allows a systematic estimation of the cor- 
rect number of natural clusters in the data. 

Fig.5a shows the SSMCL clustering result for 
the yeast cell cycle expression data (http://genomics. 
stanford.edu). We observe that the 22 clusters found 
have no apparent visual similarity. We checked the 22 
clusters with the study of Cho et al. (1998), where 416 
genes have been interpreted biologically. Those gene 
expression profiles include five fundamental patterns 
that correspond to five cell cycles phases: early G 1 , late 
Gl , S, G2, and M phase. Fig.5b shows the five clusters 
that contain most of the genes belonging to these five 
different patterns. It is obvious that these five clusters 
correspond to the five cell cycle phases. 



velop visualizing tools for the high dimensional gene 
expression clusters. The parallel coordinate plot is a 
well-known visualization technique for high dimension 
data and we are currently investigating it for bicluster 
visualization (Cheng et al., 2007). 



CONCLUSION 

Cluster analysis is an important exploratory tool for 
gene expression data analysis. This article describes 
two recently proposed clustering algorithms for gene 
expression data. In the BHC algorithm, the data are 
successively partitioned into two classes using the 
Fisher criterion. The binary partitioning leads to a 
tree structure representation which facilitates easy 
visualization. In the SSMCL algorithm, the OPTOC 
learning framework allows the detection of natural 
clusters. The subsequent over-clustering and merging 
step then allows a systematic estimation of the correct 
number of clusters in the data. Finally, we discuss 
some possible avenues of future research in this area. 
The problem of biclustering of gene expression data 
is a particularly interesting topic that warrants further 
investigation. 



FUTURE TRENDS 

Microarray data is usually represented as a matrix with 
rows and columns correspond to genes and conditions 
respectively. Conventional clustering algorithms can 
be applied to either rows or columns but not simultane- 
ously. However, an interesting cellular process is often 
active only in a subset of conditions, or a single gene 
may participate in multiple pathways that may not be 
co-active under all conditions. Biclustering methods 
allow the clustering of rows and columns simultane- 
ously. Existing biclustering algorithms often iteratively 
search for the best possible sub-grouping of the data 
by permuting rows and columns of the data matrix 
such that an appropriate merit function is improved 
(Madeira & Oliveira, 2004). When different bicluster 
patterns co-exist in the data, no single merit function 
can adequately cater for all possible patterns. Although 
recent approach such as geometric biclustering (Gan, 
Liew & Yan, 2008) has shown great potential, much 
work is still needed here. It is also important to de- 
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KEY TERMS 

BHC Clustering: A clustering algorithm based on 
the hierarchical binary subdivision framework. The 
algorithm proceeds by partitioning a cluster into two 
classes, and then check for the validity of the partition by 
Fisher discriminant analysis. The algorithm terminates 
when no further valid subdivision is possible. 

Biclustering: Also called two-way clustering or 
co-clustering. In biclustering, not only the objects but 
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also the features of the objects are clustered. If the data 
is represented in a data matrix, the rows and columns 
are clustered simultaneously. 

Cluster Analysis: An exploratory data analysis 
technique that aims at finding groups in a data such 
that objects in the same group are similar to each other 
while objects in different groups are dissimilar. 

DNAMicroarray technology: Atechnology that al- 
lows massively parallel, high throughput genome-wide 
profiling of gene expression in a single hybridization 
experiment. The method is based on the complementary 
hybridization of DNA sequence. 

Gene Expression : Gene expression is the process by 
which a gene's DNA sequence is converted into func- 
tional proteins. Some genes are turned on (expressed) 
or turned off (repressed) when there is a change in 
external conditions or stimuli. 

Hierarchical Clustering: A clustering method that 
finds successive clusters using previously established 
clusters. Hierarchical algorithms can be agglomerative 
(bottom-up) or divisive (top-down). Agglomerative al- 
gorithms begin with each element as a separate cluster 



and merge them into successively larger clusters. Divi- 
sive algorithms begin with the whole set and proceed 
to divide it into successively smaller clusters. 

K-Means Clustering: K-means clustering is the 
most well-known partition-based clustering algorithm. 
The algorithm starts by choosing k initial centroids, 
usually at random. Then the algorithm alternates 
between updating the cluster assignment of each data 
point by associating with the closest centroid and 
updating the centroids based on the new clusters until 
convergence. 

SSMCL Clustering: A partition-based clustering 
algorithm that is based on the one-prototype-take-one- 
cluster (OPTOC) learning paradigm. OPTOC learning is 
achieved by constructing a dynamic neighborhood that 
favors patterns inside the neighborhood. Eventually, the 
prototype settles at the center of a natural cluster, while 
ignoring competitions from other clusters. Together 
with the over-clustering and merging process, SSMCL 
is able to find all natural clusters in the data. 
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INTRODUCTION 

Clustering analysis is an intrinsic component of 
numerous applications, including pattern recognition, 
life sciences, image processing, web data analysis, 
earth sciences, and climate research. As an example, 
consider the biology domain. In any living cell that 
undergoes a biological process, different subsets 
of its genes are expressed in different stages of the 
process. To facilitate a deeper understanding of these 
processes, a clustering algorithm was developed (Ben- 
Dor, Shamir, & Yakhini, 1999) that enabled detailed 
analysis of gene expression data. Recent advances in 
proteomics technologies, such as two-hybrid, phage 
display and mass spectrometry, have enabled the 
creation of detailed maps of biomolecular interaction 
networks. To further understanding in this area, a 
clustering mechanism that detects densely connected 
regions in large protein-protein interaction networks that 
may represent molecular complexes was constructed 
(Bader & Hogue, 2003). In the interpretation of remote 
sensing images, clustering algorithms (Sander, Ester, 
Kriegel, & Xu, 1 998) have been employed to recognize 
and understand the content of such images. In the 
management of web directories, document annotation 
is an important task. Given a predefined taxonomy, 
the objective is to identify a category related to the 
content of an unclassified document. Self-Organizing 
Maps have been harnessed to influence the learning 
process with knowledge encoded within a taxonomy 
(Adami, Avesani, & Sona, 2005). Earth scientists are 
interested in discovering areas of the ocean that have a 
demonstrable effect on climatic events on land, and the 
SNN clustering technique (Ertoz, Steinbach, & Kumar, 
2002) is one example of a technique that has been 
adopted in this domain. Also, scientists have developed 
climate indices, which are time series that summarize 
the behavior of selected regions of the Earth's oceans 
and atmosphere. Clustering techniques have proved 



crucial in the production of climate indices (Steinbach, 
Tan, Kumar, Klooster, & Potter, 2003). 

In many application domains, clusters of data are 
of arbitrary shape, size and density, and the number 
of clusters is unknown. In such scenarios, traditional 
clustering algorithms, including partitioning methods, 
hierarchical methods, density-based methods and grid- 
based methods, cannot identify clusters efficiently or 
accurately. Obviously, this is a critical limitation. In 
the following sections, a number of clustering methods 
are presented and discussed, after which the design of 
an algorithm based on Density and Density-reachable 
(CADD) is presented. C ADD seeks to remedy some of 
the deficiencies of classical clustering approaches by 
robustly clustering data that is of arbitrary shape, size, 
and density in an effective and efficient manner. 



BACKGROUND 

Clustering aims to identify groups of objects (clusters) 
that satisfy some specific criteria, or share some com- 
mon attribute. Clustering is a rich and diverse domain, 
and many concepts have been developed as the un- 
derstanding of clustering develops and matures (Tan, 
Steinbach, & Kumar, 2006). As an example, consider 
spatial distribution. A typology of clusters based on 
this includes: Well-separated clusters, Center-based 
clusters, Contiguity-based clusters, and Density-based 
clusters. Given the diversity of domains in which clus- 
tering can be applied, and the diverse characteristic and 
requirements of each, it is not surprising that numer- 
ous clustering algorithms have been developed. The 
interested reader is referred to the academic literature 
(Qiu, Zhang, & Shen, 2005), (Ertoz, Steinbach, & Ku- 
mar, 2003), (Zhao, Song, Xie, & Song, 2003), (Ayad 
& Kamel, 2003), (Karypis, Han, & Kumar, 1999) for 
further information. 
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Though the range of clustering algorithms that have 
been developed is broad, it is posisble to classify them 
according the broad approach or method adopted by 
each: 

A partitioning method creates an initial set of k 
partitions, where the parameter k is the number 
of partitions to be constructed. Then it uses an 
iterative relocation technique that attempts to 
improve the partitioning by moving objects from 
one group to another. Typical partitioning methods 
include K-means , K-medoids, CLARANS, and 
their derivatives. 

A hierarchical method creates a hierarchical de- 
composition of the given set of data objects. The 
method can be classified as being either agglom- 
erative (bottom-up) or divisive (top-down), based 
on how the hierarchical decomposition is formed. 
To compensate for the rigidity of the merge or 
split, the quality of hierarchical agglomeration 
can be improved by analyzing object linkages at 
each hierarchical partition - an approach adopted 
by the CURE and Chameleon algorithms, or by 
integrating other clustering techniques such as 
iterative relocation - an approach adopted by 
BIRCH. 

A density-based method clusters obj ects based on 
the concept of density. It either grows the cluster 
according to the density of the neighborhood 
objects (an approach adopted by DBS CAN), or 
according to some density function (such that 
used by DENCLUE). 

A grid-based method first quantizes the object 
space into a finite number of cells thus forming 
a grid structure, and then performs clustering on 
the grid structure. STING is a typical example of a 
grid-based method based on statistical information 
stored in grid cells. CLIQUE and Wave Cluster 
are examples of two clustering algorithms that 
are both grid-based and density-based. 
A model-based method hypothesizes a model for 
each of the clusters and finds the best fit of the 
data to that model. Typical model-based methods 
involve statistical approaches (such as COB WER, 
CLASSIT, and AutoClass). 

In essence, practically all clustering algorithms 
attempt to cluster data by trying to optimize some 
objective function. 
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DEVELOPMENT OF A CLUSTERING 
AGORITHM 

Before the development of a clustering algorithm can 
be considered, it is necessary to consider some prob- 
lems intrinsic to clustering. In data mining, efforts 
have focused on finding methods for efficient and 
effective cluster analysis in large databases. Active 
themes of research focus on the effectiveness of meth- 
ods for clustering complex shapes and types of data, 
high-dimensional clustering techniques, scalability of 
clustering methods, and methods for clustering mixed 
numerical and categorical data in large databases. When 
clustering algorithms are analyzed, it is obvious that 
there are some intrinsic weaknesses that affect their 
applicability: 

Reliability to parameter selection - for partitioning 
methods and hierarchical methods, it is necessary 
to input parameters both for the number of clusters 
and the initial centroids of the clusters. This is 
difficult for unsupervised data mining when there 
is lack of relevant domain knowledge (Song & 
Meng, 2005 July), (Song & Meng, 2005 June). 
At the same time, different random initializations 
for number of clusters and centroids of clusters 
produce diverse clustering results, indicating 
a lack of stability on the part of the clustering 
method. 

Sensitivity to noise and outliers - noise and out- 
liers can unduly influence the clusters derived 
by partitioning methods, hierarchical methods, 
grid-based methods, and model-based methods; 
however partitioning methods and hierarchical 
methods are particularly susceptible. 
Selectivity to cluster shapes -partitioning methods, 
hierarchical methods, and grid-based methods are 
not suitable for all types of data distribution, and 
cannot handle non-globular clusters of different 
shapes, sizes and densities. 
Ability to detect outliers - density-based methods 
are relatively resistant to noise and can handle 
clusters of arbitrary shapes and sizes, but can not 
detect outliers effectively. 
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Experimental Analysis of Traditional 
Clustering Agorithms 

In order to visually illustrate the clustering results, two- 
dimensional data sets as experimental data are used. 
However, it should be noted that the analysis results 
are also suitable for higher dimensional data. 

Partitioning Method 

K-means, a partitioning method, is one of the most 
commonly used clustering algorithms, but it does not 
perform well on data with outliers or with clusters of 
different sizes or non-globular shapes. This clustering 
method is the most suitable for capturing clusters with 
globular shapes, but this approach is very sensitive to 
noise and cannot handle clusters of varying density. 

Hierarchical Method 

Hierarchical clustering techniques are a second im- 
portant category of clustering methods, but the most 
commonly used methods are agglomerative hierarchical 
algorithms. Agglomerative hierarchical algorithms are 
expensive in terms of their computational and storage 
requirements. The space and time complexity of such 
algorithms severely restricts the size of the data sets 
that can be used. Agglomerative hierarchical algorithms 
identify globular clusters but cannot find non-globular 
clusters, and also cannot identify any outliers or noise 
points. CURE relies on an agglomerative hierarchical 
scheme to perform the actual clustering. The distance 
between two clusters is defined as the minimum dis- 
tance between any two of their representative points 
(after the representative points are shrunk toward their 



respective centres). During this hierarchical clustering 
process, CURE eliminates outliers by eliminating small, 
slowly growing clusters. Although the concept of rep- 
resentative points does allow CURE to find clusters of 
different sizes and shapes in some data sets, CURE is 
still biased towards finding globular clusters, as it still 
incorporates the notion of a cluster centre. 

Density-Based Methods 

Density-based clustering locates regions of high density 
that are separated from one another by regions of low 
density. DBSCAN is a typical and effective density- 
based clustering algorithm which can find different 
types of clusters, identify outliers and noise, but it can 
not handle clusters of varying density. DBSCAN can 
have difficulty with density if the density of clusters 
and noise varies widely. Consider Figure 1, which il- 
lustrates three clusters embedded in noises of unequal 
densities. The noise around the clusters A and B has the 
same density as cluster C. If the Eps threshold is high 
enough such that DBSCAN finds A and B as separate 
clusters, and the points surrounding them are marked 
as noise, then C and the points surrounding it will also 
be marked as noise. 

An Algorithm Based on Density and 
Density-Reachable 

Based on the notions of density and density reachable 
(Meng & Zhang, 2006), (Meng & Song, 2005), a cluster- 
ing algorithm that can find clusters of arbitrary shapes 
and sizes, handle clusters and noises of varying density, 
minimize the reliability of domain knowledge, and iden- 
tify outliers efficiently, can be designed. A prerequisite 
to this is the provision of some key definitions: 




Figure 1. Clusters embedded in noise points of unequal 
density 




1 . The density of the data points is defined as the 
sum of the influence functions associated with 
each point: 

n d(x i , Xj y 

density(x i )=^ i e 2 ° 2 

7=1 

where the Gaussian influence function 



(*, , Xj )= i 



indicates the density influence 



of each data points on the density of point x.; 
and o is the density adjustment parameter which 
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is analogous to be the standard deviation, and 
governs how quickly the influence of a point 
drops off. 

Density-reachable distance is used to determine 
a circular area of data point x, labeled as 

8 = { x | < d(x i ,x j ) < R }, the data points of 
which all belong the same cluster. The formula 
is: 



R 



mean(D) 



coefR 



where mean(D) is the mean distance between all 
data points in data set, and coefR is the adjustment 
coefficient of the density-reachable distance. 

3. Local density attractors are the data points at 
which the value of the density function is a local 
maximum. 

4. Density-reachable is defined as follows: if there 

is an object's chain p 1 ,p 2 ,--P n ,P Q = q q is 

a local density attractor, and p n _ 1 is density- 
reachable from q, then for Pi eD,(l<z<n-l) 

and d(p if p i+1 )<R, we define that object 

p [ ,(l < z < n - 1) as being density-reachable from 
q. 



THE CADD ALGORITHM 

See Exhibit A. 
Experimental Results 

The clustering algorithm was developed using C++. In 
order to verify the effectiveness of the algorithm, a large 
number of experiments were carried out with different 
data sets which contain clusters of different shape, size, 
and density. The results are now considered. 

Clusters of Complex Shapes 

The clustering results of clusters of arbitrary shapes 
and sizes are shown in Figure 2 and Figure 3, the data 
set being taken from (Karypis et al., 1999). 

In contrast to partitioning and hierarchical methods, 
the clustering algorithm CADD identified the clusters 
of arbitrary shapes and sizes, and identified the outliers 
effectively. 

Clusters of Varying Density 

Density-based methods can perform poorly when 
clusters have widely differing densities. The clustering 
algorithm CADD assign each obj ect to a cluster accord- 



ExhibitA. 



Algorithm Clustering algorithm based on density and density-reachable (CADD) 
Input : Data set, adjustment coefficient of density-reachable distance. 
Output : Number of clusters, the members of each cluster, outliers or noise points. 
Method : 

1 : Compute the densities of each data points and construct original data chain table of clustering 

objects. 

2 : i<-l 

3 : repeat 

4 : Seek the maximum density attractor Den sityMaxi in the original data chain table of clustering 

objects as the first cluster center of Q. 

5 : Assign the objects in the data chain which are density reachable from Den sityMaxi to cluster Ci,and 

at the same time delete the clustered objects form original data chain table. 

6 : i<-i+l 
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Figure 2. Clustering result of winded clusters 
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Figure 3. Clustering result of complex clusters 
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storage, and 80GB hard disk. All objects in the data 
set possessed 117 attributes. 

The basic time complexity of DBS CAN is 0(n 
xtime to find objects in the Eps-neighborhood, where 
n is the number of data objects). In the worst case, this 
complexity is 0(n 2 ). However, in low-dimensional 
spaces, there are data structures, such as kd-trees, that 
allow efficient retrieval of all objects within a given 
distance of a specified point, and the time complexity 
can be as low as 0(n log n). The space complexity 
of DBSCAN is 0(n). The basic time complexity of 
C ADD is 0(kn), where k is the number of clusters. This 
is because it is only necessary to search the density- 
reachable objects in the data set once for each cluster, 
and in the worst case, the time complexity will be 0(n 2 ) . 
The space requirement of CADD is 0(n). 



FUTURE TRENDS 

As many different clustering algorithms have been 
developed in a variety of domains for different types of 
applications, details of experimental clustering results 




ing to density-reachable. When managing clusters of 
varying density, it is important to adjust the distance 
of density-reachable. This adjustment is carried out by 
multiplying the original density-reachable distance by 
the ratio of the density of the first density attractor to 
the density of the second density attractor, so as to suc- 
ceed in identifying clusters of varying density. This is 
because when the density of the local density attractor 
of a cluster is larger, the distance between objects in 
the cluster is smaller, and conversely, when the density 
of the local density attractor of a cluster is smaller, the 
distance between objects in the cluster is larger. 

Figure 4 demonstrates that CADD identifies three 
clusters C 1, C2, and C3 of varying density and which are 
embedded in outliers or noise points of unequal density. 
The clustering result reflects the characteristics of data 
distribution, and is not affected by varying density. 

Experimental Analysis of Computational 
Complexity 

After increasing the size of the experimental data set, 
the run times of DB SC AN and CADD were recorded, as 
shown in Figure 5. The workstation used to perform the 
experiment was equipped with 1 .6GHz CPU, 5 12MB 



Figure 4. The result of clusters embedded in noises of 
unequal density 




Figure 5. Run times of DBSCAN and CADD 
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are discussed in many academic texts and research 
papers. None of these algorithms are suitable for 
every kind of data, clusters, and applications, and so 
there is significant scope for developing new cluster- 
ing algorithms that are more efficient or better suited 
to a particular type of data, cluster, or application. This 
trend is likely to continue, as new sources of data are 
becoming increasingly available. Wireless Sensor Net- 
works ( WSNs) are an example of a new technology that 
can be deployed in a multitude of domains; and such 
networks will almost invariably give rise to significant 
quantities of data that will require sophisticated analysis 
if meaningful interpretations are to be attained. 



CONCLUSION 

Clustering forms an important component in many 
applications, and the need for sophisticated, robust 
clustering algorithms is likely to increase over time. 
One example of such an algorithm is CADD. Based 
on the concepts of density and density-reachable, 
it is overcomes some of the intrinsic limitations of 
traditional clustering mechanisms, and its improved 
computational efficiency and scalability have been 
verified experimentally. 
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KEY TERMS 

Cluster Analysis: Cluster analysis groups data 
objects based only on information found in the data 
that describes the objects and their relationships. 

CADD: A clustering algorithm based on the con- 
cepts of Density and Density-reachable. 



Centre-Based Clusters: Each object in a centre- 
based cluster is closer to the centre of the cluster than 
to the centres of any other clusters. 

Contiguity-Based Clusters: Each object in a con- 
tiguity-based cluster is closer to some other object in 
the cluster than to any point in a different cluster. 

Density-Based Clusters: Each object in a density- 
based cluster is closer to some other object within its 
Eps neighbourhood than to any obj ect not in the cluster, 
resulting in dense regions of objects being surrounded 
by regions of lower density. 

Eps: Maximum radius of the neighbourhood. 

Well-Separated Cluster: Acluster is a set of obj ects 
in which each object is significantly closer (or more 
similar) to every other object in the cluster than to any 
object not in the cluster. 
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INTRODUCTION 

Automated diagnosis and prognosis of tumors of the 
central nervous system (CNS) offer overwhelming 
challenges because of heterogeneous phenotype and 
genotype behavior of tumor cells (Yang et al. 2003, 
Pomeroy et al. 2002). Unambiguous characterization 
of these tumors is essential for accurate prognosis and 
therapy. Although the present imaging techniques help 
to explore the anatomical features of brain tumors, they 
do not provide an effective means of early detection. 
Currently, the histological examination of brain tumors 
is widely used for an accurate diagnosis; however, the 
tumor classification and grading based on histological 
appearance does not always guarantee absolute accuracy 
(Yang et al., 2003, Pomeroy et al., 2002). In many cases, 
it may not be sufficient to detect the detailed changes 
in the molecular level using a histological examina- 
tion (Yang et al. 2003) since such examination may 
not allow accurate prediction of therapeutic responses 
or prognosis. If the biopsy sample is too small, the 
problems are aggravated further. 

Toward achieving a more reliable diagnosis and 
prognosis of brain tumors, gene expression measures 
from microarrays are the center of attention to many 
researchers who are working on tumor prediction 
schemes. Our proposed tumor prediction scheme is 
discussed in two chapters in this volume. In part I (this 
chapter), we use an analysis of variance (ANOVA) 
model for characterizing the Affymetrix gene expres- 
sion data from CNS tumor samples (Pomeroy et al. 
2002) while in part II we discuss the prediction of 



tumor classes based on marker genes selected using the 
techniques developed in this chapter. In this chapter, 
we estimate the tumor-specific gene expression mea- 
sures based on the ANOVA model and exploit them to 
locate the significantly differentially expressed marker 
genes among different types of tumor samples. We also 
provide a novel visualization method to validate the 
marker gene selection process. 



BACKGROUND 

Numerous statistical methods have evolved that are 
focused on the problem of finding the marker genes 
that are differentially expressed among tumor samples 
(Pomeroy et al., 2002, Islam et al., 2005, Dettling et 
al., 2002, Boom et al., 2003, Park et al., 2001). For 
example, Pomeroy et al. (2002) uses student t-test to 
identify such genes in embryonal CNS tumor samples. 
Because of the non-normality of gene expression 
measurements, several investigators have adopted the 
use of nonparametric methods, such as the Wilcoxon 
Sum Rank Test (Wilcoxon, 1945) as a robust alterna- 
tive to the parametric procedures. In this chapter, we 
investigate a Wilcoxon-type approach and adapt the 
resulting procedures for locating marker genes. 

Typically, statistical procedures for microarray data 
analysis involve performing gene specific tests. Since 
the number of genes under consideration is usually 
large, it is common practice to control the potentially 
large number of false-positive conclusions and fam- 
ily-wise error rates (the probability of at least one 
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false positive statement) through the use of P-value 
adjustments. Pollard et al. (2003) and Van der Laan 
et al. (2004a, 2004b, 2005c) proposed methods to 
control family-wise error rates based on the bootstrap 
resampling technique of Westfall & Young (1993). 
Benjamini & Hochberg (1995), Efron et al. (2001) and 
Storey et al. (2002, 2003a, 2003b, 2004) introduced 
various techniques for controlling the false discovery 
rate (FDR), which is defined as the expected rate of 
falsely rejecting the null hypotheses of no differential 
gene expression. These adjustment techniques have 
gained prominence in statistical research relating to 
microarray data analysis. Here, we use FDR control 
because it is less conservative than family-wise error 
rates for adjusting the observed P-values for false dis- 
covery. In addition, we propose a novel marker gene 
visualization technique to explore appropriate cutoff 
selection in the marker gene selection process. 

Before performing formal analysis, one should 
identify the actual gene expression levels associated 
with different tissue groups and discard or minimize 
other sources of variations. Such an approach has been 
proposed by Townsend & Hartl (2002) who use a Bayes- 
ian model with both multiplicative and additive small 
error terms to detect small, but significant differences 
in gene expressions. As an alternative, an ANOVA 
model appears to be a natural choice for estimating 
true gene expression (Kerr et al., 2000, Pavlidis et al., 
2001, Wolfinger et al. 2001). In the context of cDNA 
microarray data, the ANOVA model was first proposed 
by Kerr et al. (2000). 



TUMOR-SPECIFIC GENE EXPRESSION 
ESTIMATION AND VISUALIZATION 
TECHNIQUES 

To illustrate our procedure, we use the a microarray 
data set by Pomeroy et al. (2002) of patients with dif- 
ferent types of embryonal tumors. The patients include 
60 children with medulloblastomas, 10 young adults 
with malignant gliomas, 5 children with AT/RTs, 5 with 
renal/extra-renal rhabdoid tumors, and 8 children with 
supratentorial PNETs. First, we preprocess the data 
to remove extraneous background noise and array ef- 
fects. To facilitate our analysis, we divide the dataset 
into groups as shown in Fig. 1. We rescale the raw 
expression data obtained from Affymetrix's GeneChip 
to account for different chip intensities. 

Microarray data typically suffer from unwanted 
sources of variation, such as large-and-small-scale 
intensity fluctuations within spots, non-additive 
background, fabrication artifacts, probe coupling and 
processing procedures, target and array preparation in 
the hybridization process, background and over-shining 
effects, and scanner settings (McLachlan & Ambroise, 
2005). To model these variations, a number of methods 
have been reported in the literature (Kerr et al., 2000, 
Lee et al., 2000, Pavlidis, 2001, Wolfinger et al., 2001, 
Ranz et al., 2003, Townsend, 2004, Tadesse et al., 2005). 
An ANOVA model similar to the one used by Kerr el 
al. (2000) is adopted in our current work and facilitates 
obtaining the tumor- specific gene expression measures 
from the preprocessed microarray data. Our two-way 
ANOVA model is given as: 




y 






H+a g + p J+ Y Jg +s 






(1) 



Figure 1. Dataset grouping 
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where, y. k denotes the log of the gene expression 
measure for the kth replicate of the gth gene in theyth 
tumor group (k= 1,..., K.; g = 1,..., G; j = 1,..., J); \i, 
a , p., y. refer to the overall effect, the gth gene main 
effect, the yth tumor-group main effect, and the gth 
gene -yth tumor group interaction effect, respectively, 
and 8. , is a random error with zero mean and constant 

Jgk 

variance. We assume that these terms are independent; 
however, we do not make any assumption about their 
distribution. 

This model assumes that background noise and array 
effects have been eliminated at previous preprocessing 
steps. This assumption fits well with our preprocessed 
data. A reason for selecting the model is that it fits well 
with our goal to build the suitable tumor prototypes for 
prediction. We believe that tumor prototypes should be 
built based only on the tumor-specific gene expression 

measures. In this model, the interaction term y. con- 

' 'jg 

stitutes the actual gene expression of gene g attributed 
to the tumor type j (McLachlan & Ambroise, 2005). 
Hence, the value of the contribution to the tumor spe- 
cific gene expression value by the /cth replication of the 
measurement on gene g in tissue j may be written as 
given in (McLachlan & Ambroise, 2005), as: 



(Wilcoxon, 1945), for the two categories problem and 
the Kruskal-Wallis (NIST, 2007) for the five categories 
problem in our dataset to identify significantly dif- 
ferentially expressed genes among the tissue sample 
types. We adjust for multiplicity of the tests involved 
by controlling the False Discovery Rate (FDR) using 
q-values as proposed by Storey et al. (2004). 

In the selection of differentially expressed genes, 
a tight cutoff may miss some of the important marker 
genes, while a generous threshold increases the number 
of false positives. To address this issue, we use the 
parallel coordinate plot (Inselberg et al., 1990) of the 
group-wise average genes expressions. In this plot, the 
parallel and equally spaced axes represent individual 
genes. Separate polylines are drawn for each group of 
tumor samples. The more the average gene expression 
levels differ between groups, the more space appears 
among the polylines in the plot. To effectively visualize 
the differentially expressed genes, we first obtain the 
average of the tumor-specific gene expression values 
within any specific tissue sample type y jg as speci- 
fied in Equation (3). We then standardize the average 
gene expression values y obtained in Equation (3) 
as follows: 



jgk 



y Jgk -£-a fl -P, 



(2) 



and the tumor specific expression is estimated by 
Equation (3): 



J_ 
K, 



Zy 






(3) 



where, fi = y , & g = y. g .~ y... and P 7 = yj.. - y... are the 
least square estimates of the gth gene andyth tumor- 
group main effects based on replications (Lee et al., 
2000, McLachlan & Ambroise, 2005). Here, we regard 
these estimates as fixed-effects estimates. In a Bayes- 
ian analysis, all of the parameters could be treated as 
random effects. The y Jgk values are considered in our 
subsequent steps of the contribution of replicate k to 
the gene expression of gene g for a patient in the j-th 
tumor group. 

Using several different procedures, such as Sha- 
piro-Wilk's test (Shapiro & Wilk, 1965) and normal 
probability plots, we observe that the gene expression 
levels in our dataset are not normally distributed. Hence, 
we choose a nonparametric test, such as Wilcoxon 






where 



K^-DS 



:2 _ / 



Z(^-d 



and 

Z(Y^-Y, g ) 2 
S 2 = k=1 



K J~ 1 



(4) 



(5) 



(6) 



Next, we divide the genes into two groups. The first 
group consists of the genes where f lg > y 2g and the 
remainder of the genes are kept in the second group. 
We group such that the tumor-type representing lines 
in our plot do not cross. Now, within each gene group, 
we again partition the genes into subgroups so that 
similarly expressed genes are grouped together. The 
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Self-Organizing Map (SOM) (Kohonen, 1987) analysis 
of variance approach is exploited for this partitioning. 
Then, within each of the subgroups, genes are ordered 
according to \fi g ~Y2 S |. This further partitioning and 
ordering method confers suitable shapes to the tumor- 
type representing lines such that the user can quickly 
visualize the tumor-type discriminating power of the 
selected genes. Before generating the final plot we 
normalize the standardized average expression values 
Yygas follows: 



y -min(y. ) 



y . = — 

m max(Y A )-min(f A ) 



(7) 



Finally, the normalized expression values y Jg are 
plotted using parallel coordinates, where each parallel 
axis corresponds to a specific gene and each polyline 
corresponds to a specific tumor type. Each gene's 
subgroups are plotted in separate plots. Algorithm 1 
specifies our formalized approach to gene visualization. 
The purpose of such plots is to qualitatively measure 
the performance of the gene selection process and to 
find the appropriate cutoff that reduces the number 
of false positives while keeping the number of true 
positives high. 

The following results illustrate the usefulness of 
our visualization method provided in Algorithm 1. The 



expression patterns of the marker genes associated with 
medulloblastoma survivor and failure groups is shown 
in Figs. 2 and 3. The solid and dotted lines represent the 
failure (death) and survivor (alive) groups, respectively. 
Genes are selected using the Wilcoxon method, which 
was previously described, wherein depending on the 
q-values, different numbers of genes are selected. In 
both Figs. 2 and 3, each individual graph represents a 
group of similarly expressed genes clustered together 
as specified in step 5 of Algorithm 1. Figs. 2 and 3 
show 280 and 54 selected marker genes, respectively. 
We observe that within 280 selected genes in Fig. 2, 
many show similar expression patterns in both failure 
and survivor sample groups indicating that two sample 
groups are close to one another on the parallel axes. 
Since each axis represents a different gene, we conclude 
that many of the genes in Fig. 2 are falsely identified 
as marker genes (false positives). In comparison, the 
solid and dotted lines are far apart on most of the paral- 
lel axes in Fig. 3. This indicates that the average gene 
expression values of the selected 54 genes are quite 
different between the two groups of tissue samples. 
Thus, this visualization aids in selecting the correct 
threshold in the marker gene selection process. 




Algorithm 1 (DATA) to visualize the expression pattern of the selected marker genes 

DATA contains the expression levels of the marker genes for all the patient samples; where each row represents different gene 
expression values and each column represents different patient samples. 

1. Sort the genes in descending order according to the values of q-values and select the top G genes from the sorted list. 

2. Estimate the average of the tumor-specific gene expression values y • within any specific tissue sample type using Eq. (3) 

3. Obtain the standardized average gene expression values y • using Eq. (4). 

4. Partition the genes into two groups: (i) Cj where y 1 — y 2 a anc * ( n ) C 2 where y 1 < y 2q • 

5. Within each group C c (c=l, 2), again partition the gene expression values y . into P clusters {C cl ... C cp } (c=l, 2) exploiting SOM , where 
each cluster consists of a group of similarly expressed genes. 

6. For each of the clusters obtained in the previous step: 

a. Order the genes according to y 1 — y 2 . 

b. Obtain y . by normalizing the standardized average expression values y • using Eq. (7). 

c. Plot the normalized average expression values y ■ using parallel coordinates, where each parallel axes corresponds to a specific gene 
and each polyline corresponds to a specific tissue sample type. 
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Figure 2. Expression patterns of the marker genes associated with medulloblastoma survivor and failure groups. 
Genes are selected using Wilcoxon method and FDR is controlled using q-values, where depending on the q- 
values, different numbers of genes are selected 280 genes. 
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Figure 3. Expression patterns of the marker genes associated with medulloblastoma survivor and failure groups. 
Genes are selected using Wilcoxon method and FDR is controlled using q-values, where depending on the q- 
values, different numbers of genes are selected 54 genes. 







Death 






\ 


/ 








~ 






^ 


= 












/ 














FUTURE TRENDS 

In this chapter, we estimate the tumor-specific gene 
expression measure exploiting an ANOVA model. We 
specify the model parameters as fixed effects; however, 
such specification may not be always appropriate. 
Rather, considering the model parameters as random 



effects may be more appropriate for microarray dataset 
(Kerr et al., 2000). Thus, one possible improvement 
may consider all of the effects in our ANOVA model as 
random. Further, specifying the random effect param- 
eters in a Bayesian framework provides a formal way 
of exploiting any prior knowledge about the parameter 
distribution, if available. We are currently in the process 
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of adopting a hierarchical Bayesian approach to our 
work following a few more recent relevant works by 
(Ibrahim et al., 2002, Lewin et al., 2006). 



CONCLUSION 

We attempted to estimate tumor-specific gene expres- 
sion measures using an ANOVAmodel. These estimates 
are then used to identify differentially expressed marker 
genes. For evaluating the marker gene identification, 
we proposed a novel approach to visualize the average 
genes expression values for a specific tissue type. The 
proposed visualization plot is useful to qualitatively 
evaluate the performance of marker gene selection 
methods, as well as to locate the appropriate cutoffs in 
the selection process. The research in this chapter was 
supported in part through research grants [RG-01-0125, 
TG-04-002 6] provided by the Whitaker Foundation with 
Khan M. Iftekharuddin as the principal investigator. 
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KEY TERMS 

DNA Microarray: A collection of microscopic 
DNA spots, commonly representing single genes, 
arrayed on a solid surface by covalent attachment to 
chemically suitable matrices. 

False Discovery Rate (FDR): Controls the expected 
proportion of false positives instead of controlling 
the chance of any false positives. An FDR threshold 
is determined from the observed p-value distribution 
from multiple single hypothesis tests. 

Histologic Examination: The examination of tissue 
specimens under a microscope. 

Kruskal-Wallis Test: A nonpar ametric mean test 
which can be applied if the number of sample groups is 
more than two,unlike the Wilcoxon Rank Sum Test. 
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Parallel Coordinates: A data visualization scheme 
that exploits 2D pattern recognition capabilities of 
humans. In this plot, the axes are equally spaced and 
are arranged parallel to one another rather than being 
arranged mutually perpendicular as in the Cartesian 
scenario. 

q- values: A means to measure the proportion of 
FDR when any particular test is called significant. 

Wilcoxon Rank Sum Test: A nonparametric 
alternative to the two sample t-test which is based 
on the order in which the observations from the two 
samples fall. 
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INTRODUCTION 

In this chapter, we propose a novel algorithm for 
characterizing a variety of CNS tumors. The proposed 
algorithm is illustrated with an analysis of an Affyme- 
trix gene expression data from CNS tumor samples 
(Pomeroy et al., 2002). As discussed in the previous 
chapter entitled: CNS Tumor Prediction Using Gene 
Expression Data Part I, we used an ANOVA model to 
normalize the microarray gene expression measure- 
ments. In this chapter, we introduce a systemic way 
of building tumor prototypes to facilitate automatic 
prediction of CNS tumors. 



BACKGROUND 

DNA microarrays, also known as genome or DNA 
chips, have become an important tool for predicting 
CNS tumor types (Pomeroy et al., 2002, Islam et al., 
2005, Dettling et al., 2002). Several researchers have 
shown that cluster analysis of DNA microarray gene 
expression data is helpful in finding the functionally 
similar genes and also to predict different cancer types. 
Eisen et al. (1998) used average linkage hierarchical 
clustering with correlation coefficient as the similarity 
measure in organizing gene expression values from 
microarray data. They showed that functionally simi- 
lar genes group into the same cluster. Herwig et al. 
(1999) proposed a variant of the K-means algorithm 
to cluster genes of cDNA clones. Tomayo et al. (1999) 
used self-organized feature maps (SOFMs) to organize 



genes into biologically relevant groups. They found 
that SOFMs reveal true cluster structure compared 
to the rigid structure of hierarchical clustering and 
the structureless K-means approach. Considering the 
many-to-many relationships between genes and their 
functions, Dembele et al. (2003) proposed a fuzzy C- 
means clustering technique. The central goal of these 
clustering procedures (Eisen et al., 1 998, Herwig et al., 
1999, Tomayo et al., 1999, Dembele et al., 2003) was 
to group genes based on their functionality. However, 
none of these works provide any systematic way of 
discovering or predicting tissue sample groups as we 
propose in our current work. 

To identify tissue sample groups, Alon et al. (1999) 
proposed a clustering algorithm that uses a determinis- 
tic-annealing algorithm to organize the data in a binary 
tree. Alizadeh et al. (2000) demonstrated a successful 
molecular classification scheme for cancers from gene 
expression patterns by using an average linkage hierar- 
chical clustering algorithm with Pearson's correlation 
as the similarity measure. However, no formal way of 
predicting the category of a new tissue sample is reported 
in (Alon et al., 1999, Alizadeh et al., 2000). Such class 
prediction problems were addressed by Golub et al. 
(1999) who used SOFMs to successfully discriminate 
between two types of human acute leukemia. Dettling 
et al. (2002) incorporated the response variables into 
gene clustering and located differentially expressed 
groups of genes from the clustering result. These gene 
groups were then used to predict the categories of new 
samples. However, none of the above-mentioned works 
(Dettling et al., 2002, Golub et al., 1999, Alon et al., 
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1999, Alizadeh et al., 2000) considered the correlation 
among the genes in classifying and/or predicting tissue 
samples. Moreover, none of these provided any sys- 
tematic way of handling the probable subgroups within 
the known groups. In this chapter, we consider both 
correlations among the genes and probable subgroups 
within the known groups by forming appropriate tumor 
prototypes. Further, a major drawback of these analyses 
(Dettling et al., 2002, Eisen et al., 1998, Herwig et al., 
1999, Tomayo et al., 1999, Dembele et al., 2003, Golub 
et al., 1999, Alon et al., 1999, Alizadeh et al., 2000) 
is insufficient normalization. Although, most of these 
methods normalize the dataset to remove the array ef- 
fects; they do not concentrate on removing other sources 
of variations present in the microarray data. 

Our primary objective in this chapter is to develop 
an automated prediction scheme for CNS tumors, 
based on DNA microarray gene expressions of tissue 
samples. We propose a novel algorithm for deriving 
prototypes for different CNS tumor types, based on Af- 
fymetrix HuGeneFL microarray gene expression data 
from Pomeroy et al. (2002). In classifying the CNS 
tumor samples based on gene expression, we consider 
molecular information, such as the correlations among 
gene expressions and probable subgroupings within the 
known histological tumor types. We demonstrate how 
the model can be utilized in CNS tumor prediction. 



CNS TUMOR PROTOTYPE FOR 
AUTOMATIC TUMOR DETECTION 

The workflow to build the tumor prototypes is shown 
in Fig. 1. In the first step, we obtain the tumor-type- 
specific gene expression measures. Then, we identify 
the marker genes that are significantly differentially 
expressed among tissue types. Next, a visualization 
technique is used to analyze the appropriateness of 
the marker gene selection process. We organize the 
marker genes in groups so that highly correlated genes 
are grouped together. In this clustering process, genes 
are grouped based on their tumor-type-specific gene 



expression measures. Then, we obtain eigengene ex- 
pressions measures from each individual gene group by 
proj ection of gene expressions into the first few principal 
components. At the end of this step, we replace the 
gene expression measurements with eigengene expres- 
sion values that conserve correlations between strongly 
correlated genes. We then divide the tissue samples of 
known tumor types into subgroups. The centroids of 
these subgroups of tissue samples with eigengene ex- 
pressions represent the prototype of the corresponding 
tumor type. Finally, any new tissue sample is predicted 
as the tumor type of the closest centroid. This proposed 
novel prediction scheme considers both the correlation 
among the highly correlated genes and the probable 
phenotypic subgrouping within the known tumor 
types. These issues are often ignored in the literature 
for predicting tumor categories. The detail of the steps 
up to the identification of marker genes are provided in 
the previous chapter entitled: CNS Tumor Prediction 
Using Gene Expression Data Part I. In this section, we 
provide the details of the subsequent steps. 

Now, we discuss the creation of the tumor proto- 
types using the tumor-specific expression values of 
our significantly differentially expressed marker genes 
identified in the previous step. Many of the marker 
genes are likely to be highly correlated. Such correla- 
tions of the genes affect successful tumor classification. 
However, this gene-to-gene correlation may provide 
important biological information. Hence, the inclusion 
of the appropriate gene-to-gene correlations in the 
tumor model may help to obtain a more biologically 
meaningful tumor prediction. To address this non-trivial 
need, we first group the highly correlated genes using 
the complete linkage hierarchical approach wherein 
correlation coefficient is considered as the pair-wise 
similarity measure of the genes. Next, for each of the 
clusters, we compute the principal components (PCs) 
and project the genes of the corresponding cluster onto 
the first 3 PCs to obtain eigengene expressions (Speed, 
2003). Note that the PCs and the eigengene expres- 
sions are computed separately for each cluster. Such 
eigengenes encode the correlation information among 




Figure 1. Simplified workflow to build the tumor prototypes 
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the highly correlated genes that are clustered together. 
Recently, molecularly distinct sub-grouping within the 
same histological tumor type has been reported (Taylor 
et al., 2005). To find a subgrouping within the same 
histological tumor type, we again use self-organizing 
maps (SOMs) (Kohonen, 1987) to cluster the tissue 
samples within each tumor group. This subgrouping 
within each group captures the possible genotypic 
variations within the same histological tumor type. 
Now, the prototype of any specific histological tumor 
type is composed of the centroid obtained from the 
corresponding SOM grid. Algorithm 1 shows our steps 
for building the tumor prototype. 

To predict the tumor category of any new sample, 
we calculate the distances between the new sample 
and each of the prototype subgroups obtained using 
Algorithm 1. The category of the sample is predicted 
as that of the closest subgroup. The distance between 
the new sample and the xth subgroup, d x , is calculated 
based on Euclidean distance as follows: 



d * = JZ ( #* -g*y 



(1) 



where g xk is the center value of k th eigengene in the x th 
subgroup, g k is the expression measure of k th eigengene 
of the new sample, and N is the total number of eigen- 
genes. This distance measure deliberately ignores the 
non-representative correlations among the eigengene 
expressions since they are not natural and hence dif- 
ficult to interpret. 

Table 1 shows the efficacy of our model in classi- 
fying five categories of tissues simultaneously. Table 
2 shows the performance comparison between our 



proposed prediction scheme and the method adopted in 
(Pomeroy et al., 2002). We observe that our prediction 
scheme outperforms the other prediction method in all 
three cases. The most noticeable difference is with data 
group C where we obtain 100% prediction accuracy 
compared with 78% accuracy. More detailed results 
and discussion can be found in (Islam et al., 2006a, 
2006b, 2006c). 



FUTURE TRENDS 

In this work, we estimated the tumor-specific gene 
expression measure exploiting an ANOVA model 
with the parameters as fixed effects. As discussed in 
the future trends section of the chapter entitled: CNS 
Tumor Prediction Using Gene Expression Data Part I, 
we may consider all of the effects in our ANO VAmodel 
as random and specify the random effect parameters 
in a Bayesian framework. Once the distributions of 
tumor-specific gene expression measures are obtained 
with satisfactory confidence, more representative tumor 
prototypes may be obtained and a more accurate tumor 
prediction scheme can be formalized. Representing the 
tumor prototypes with a mixture of Gaussian models 
may provide a better representation. In that case, finding 
the number of components in the mixture is another 
research question left for future work. 



CONCLUSION 

For automatic tumor prediction, we have proposed a 
novel algorithm for building CNS tumor prototypes 



Algorithm 1 (DATA) to build tumor prototype 



DATA contains the expression levels of the marker genes for all the patient samples; where each row represents different gene expression values 
and each column represents different patient samples. 

1. Cluster genes into K partitions, C= (C h C 2 ,. . ., C k }, using Complete Linkage Hierarchical approach and correlation coefficient as the pair 
wise similarity measure. 

2. For each cluster Q 

a. Compute the principal components (PCs). 

b. If the cluster cardinality is greater than 3, project the genes onto the first 3 PCs else project the genes onto the first PC. 
Note: the projected vectors are considered as eigengene expressions. 

3. Considering the eigengene expressions as feature vectors, cluster each histological tumor group into subgroups exploiting SOM. 

4. The set of centroid of the corresponding SOM grid is designated as the tumor prototype of that particular histological tumor type. 
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Table 1. Confusion matrix for five categories of tumor samples 
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Precision 


1.0 


0.9 


0.77 


1.0 


1.0 






Overall Classification Accuracy: 90% 



Table 2. Comparison table 



Data 
Group 


Number of 
Categories 


Number of Samples 


Classification Accuracy 
(our method) 


Classification Accuracy 
(Pomeroy et al., 2002) 


A 


5 


42 


90% 


83% 


B 


2 


34 


100% 


97% 


C 


2 


60 


100% 


78% 



based on Af fymetrix microarray gene expression values. 
We derived prototypes for different histological tumor 
types considering their genotype heterogeneity within 
groups. The eigengenes encode the correlations among 
gene expressions into the prototypes. Also, the eigen- 
gene expression measures are derived from estimated 
tumor-specific gene expression measures that are free 
from other unwanted sources of variations. We proposed 
a novel, seamless procedure that integrates normaliza- 
tion and tumor prediction considering both probable 
subgroupings within known tumor types and probable 
correlations among genes. The strong compliance of 
our results with the current molecular classification of 
the available tumor types suggests that our proposed 
model and its unique solution have significant practical 
value for automatic CNS tumor detection. 

The research in this chapter was supported in part 
through research grants [RG-01-0125, TG-04-0026] 
provided by the Whitaker Foundation with Khan M. 
Iftekharuddin as the principal investigator. 
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KEY TERMS 

DNA Microarray: Also known as a DNA chip, it 
is a collection of microscopic DNA spots, commonly 
representing single genes, arrayed on a solid surface by 
covalent attachment to chemically suitable matrices. 

False Discovery Rate (FDR): FDR controls the 
expected proportion of false positives instead of control- 
ling the chance of any false positives. A FDR threshold 
is determined from the observed p-value distribution 
from multiple single hypothesis tests. 

Histologic Examination: The examination of tissue 
specimens under a microscope. 

Kruskal-Wallis Test: This test is a nonparamet- 
ric mean test which can be applied if the number of 
sample group is more than two, unlike the Wilcoxon 
Rank Sum Test. 

Parallel Coordinates: A multidimensional data 
visualization scheme that exploits 2D pattern recogni- 
tion capabilities of humans. In this plot, the axes are 
equally spaced and are arranged parallel to one another 
rather than being arranged mutually perpendicular as 
in the Cartesian scenario. 

q- Values: A means to measure the proportion of 
FDR when any particular test is called significant. 

Self-Organizing Maps (SOMs): A method to 
learn to cluster input vectors according to how they 
are naturally grouped in the input space. In its simplest 
form, the map consists of a regular grid of units and 
the units learn to represent statistical data described by 
model vectors. Each map unit contains a vector used 
to represent the data. During the training process, the 
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model vectors are changed gradually and then the map 
forms an ordered non-linear regression of the model 
vectors into the data space. 

Wilcoxon Rank Sum Test: A nonparametric 
alternative to the two sample t-test which is based 
on the order in which the observations from the two 
samples fall. 
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INTRODUCTION 

Expert combination is a classic strategy that has been 
widely used in various problem solving tasks. A team 
of individuals with diverse and complementary skills 
tackle a task jointly such that a performance better 
than any single individual can make is achieved via 
integrating the strengths of individuals. Started from 
the late 1980' in the handwritten character recogni- 
tion literature, studies have been made on combining 
multiple classifiers. Also from the early 1990' in the 
fields of neural networks and machine learning, efforts 
have been made under the name of ensemble learning or 
mixture of experts on how to learn jointly a mixture of 
experts (parametric models) and a combining strategy 
for integrating them in an optimal sense. 

The article aims at a general sketch of two streams 
of studies, not only with a re-elaboration of essential 
tasks, basic ingredients, and typical combining rules, 
but also with a general combination framework (es- 
pecially one concise and more useful one-parameter 
modulated special case, called a-integration) suggested 
to unify a number of typical classifier combination 
rules and several mixture based learning models, as 
well as max rule and min rule used in the literature on 
fuzzy system. 



BACKGROUND 

Both streams of studies are featured by two periods of 
developments. The first period is roughly from the late 
1980s to the early 1990s. In the handwritten character 
recognition literature, various classifiers have been 
developed from different methodologies and different 
features, which motivate studies on combining multiple 
classifiers for a better performance. A systematical 
effort on the early stage of studies was made in (Xu, 



Krzyzak & Suen, 1992), with an attempt of setting up 
a general framework for classifier combination. As re- 
elaborated in Tab.l, not only two essential tasks were 
identified and a framework of three level combination 
was presented for the second task to cope with different 
types of classifier 's output information, but also several 
rules have been investigated towards two of the three 
levels, especially with Bayes voting rule, product rule, 
and Dempster-Shafer rule proposed. Subsequently, 
the rest one (i.e., rank level) was soon studied in (Ho, 
Hull, & Srihari, 1994) via Borda count. 

Interestingly and complementarily, almost in the 
same period the first task happens to be the focus of 
studies in the neural networks learning literature. En- 
countering the problems that there are different choices 
for the same type of neural net by varying its scale 
(e.g., the number of hidden units in a three layer net), 
different local optimal results on the same neural net 
due to different initializations, studies have been made 
on how to train an ensemble of diverse and comple- 
mentary networks via cross-validation- partitioning, 
correlation reduction pruning, performance guided 
re-sampling, etc, such that the resulted combination 
produces a better generalization performance (Hansen & 
Salamon, 1990; Xu, Krzyzak, & Suen, 1991; Wolpert, 
1992; Baxt, 1992, Breiman, 1992&94; Drucker, et al, 
1994). In addition to classification, this stream also 
handles function regression via integrating individual 
estimators by a linear combination (Perrone & Cooper, 
1993). Furthermore, this stream progresses to consider 
the performance of two tasks in Tab.l jointly in help 
of the mixture-of-expert (ME) models (Jacobs, et al, 
1991; Jordan & Jacobs, 1994; Xu & Jordan, 1993; 
Xu, Jordan & Hinton, 1994), which can learn either 
or both of the combining mechanism and individual 
experts in a maximum likelihood sense. 

Two stream studies in the first period jointly set 
up a landscape of this emerging research area, together 
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Table 1. Essential tasks and their implementations 



Two Tusks (a quotation tram Xu, Ktv.yzuk and Suen, 1*)*I2 :« 



Task 1: "Hiiw ninny and what type nf clas! Killers should be use d for i t r.pocifit; iip]diiitit i on ? t 
and for each classifier what type of features should we use?, as well as other problems that 
are related to the construction of those individual and complementary classifier 1 . 

Task 2: "Huw to combine the results from different enisling classifiers so that 9 belter 
result ean he obtained?* 



Two Styles of Implementations 



Two Stage Implementation 

•Task I is completed in advance, with the resulted 
tlassilitTS being di verse ami complementary. 
* Perform Task 2 in one of three levels (Xu, Krzyzak 
and Suen, 1992). 




Joint Implementation 

Two tasks made jointly or alternatively 



under a same 

criterion 

■ Mixture of experts 
(ME) (Jacobs, ctaU991; 
Jordan & Jacobs, 1991); 

■Alternative ME (Xu & 
Jordan, 19P3, Xu, Jordan 
& Hinlon, 1994); 

■ FM-EBF (Xu,l W8) 
•Three layer nets, etc. 



others 



Stacking , 

Roosting, 

..., etc 

(Bid man. 

1992&94; 

Wolpst, 

1$92) 




with a number of typical topics or directions. Thereafter, 
further studies have been further conducted on each 
of these typical directions. First, theoretical analy- 
ses have been made for deep insights and improved 
performances. For examples, convergence analysis 
on the EM algorithm for the mixture based learning 
are conducted in (Jordan & Xu, 1995; Xu & Jordan, 
1996). In Turner & Ghosh (1996), the additive errors 
of posteriori probabilities by classifiers or experts are 
considered, with variances and correlations of these 
errors investigated for improving the performance of 
a sum based combination. In Kittler, et al (1998), the 
effect of these errors on the sensitivity of sum rule vs 
product rule are further investigated, with a conclusion 
that summation is much preferred. Also, a theoretical 
framework is suggested for taking several combining 
rules as special cases (Kittler, 1998) , being unaware 
of that this framework is actually the mixture-of-ex- 
perts model that was proposed firstly for combining 
multiple function regressions in (Jacobs, et al, 1991) 
and then for combining multiple classifiers in (Xu & 
Jordan, 1993). In addition, another theoretical study is 
made on six classifier fusion strategies in (Kuncheva, 
2002). Second, there are further studies on Dempster- 
Shaf er rule (Al-Ania, 2002) and other combing methods 
such as rank based, boosting based, as well as local 



accuracy estimates (Woods, Kegelmeyer, & Bowyer, 
1997). Third, there are a large number of applications. 
Due to space limit, details are referred to Ranawana & 
Palade (2006) and Sharkey & Sharkey (1999). 



A GENERAL ARCHITECTURE, TWO 
TASKS, AND THREE INGREDIENTS 

We consider a general architecture shown in Fig.l. 
There are {e ; (x)}* =1 experts with each e (x) as either a 
classifier or an estimator. As shown in Tab. 2, a classi- 
fier outputs one of three types of information, on which 
we have three levels of combination. The first two 
can be regarded as special cases of the third one that 
outputs a vector of measurements. Atypical example 

is [p y (l|x),...,p 7 (m|x)] r with each l>p y (^|x)>0 
expressing a posteriori probability that x is classified to 

the ^-th class. Also, p . (I | x) =p . (y = t | x) can be further 

extended to p (y \ x) that describes a distribution for a 

regression x -» y e R m . In Figure 1, there is also a gat- 
ing net that generates signals {a ; (x)} 7=1 to modulate 
experts by a combining mechanism M(x). 
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Figure 1. A general architecture for expert combination 

y 




Expert ^(x) Uxpert ^(x) Expert ^(x) 



a^x) a.ix) a,(x) 
Gating net 



Table 2. Three levels of combination 



Three Rules tor Combination on Level 3 



Sum rule | Buyes Voting) 



Given A' classifiers, (lie 
j-th classifier classifies x 
lo y with a probability 

we sum up to get a 
combination 

KjI^)-~S"- 1 ^OI i ) 

Sft- cqn.(4) in 
(Xu> kr/.v/ak ami Such, 
1992) 



Product rule 



If k classifiers ure 

independent another 

CCJmblTial.IOIl K 

given by 

H 

or concisely 

Sec eqn,(Jl) in 
(Xu, Kr/y/sik nnd 
Suen, 1952) 



Dempster-Shaler rule 



bet (A) 2^,*(B) 
/l, derides x t. C ; , 

A * {4) 

See Sec. VI in 
(Xu, Krzyzak and 
Suen, 1S92) 



Based on this architecture, two essential tasks of 
expert combination could be still quoted from (Xu, 
Krzyzak & Suen, 1992) with a slight modification on 
Task 1 , as shown in Tab. 1 , that the phrase " for a specific 
application?' should be deleted in consideration of the 
previously introduced studies (Hansen & Salamon, 
1990; Xu, Krzyzak, & Suen, 1991; Wolpert, 1992; 
Baxt, 1992, Breiman, 1992&94; Drucker, et al, 1994; 
Turner & Ghosh, 1996). 

Insights can be obtained by considering three basic 
ingredients of two streams of studies, as shown in Fig. 2. 
Combinatorial choices of different ingredients lead to 
different specific models for expert combination, and 
differences in the roles by each ingredient highlight 
the different focuses of two streams. In the stream 



of neural networks and machine learning, provided 
with a structure for each e.(x), a gating structure, and 
a combining structure M(x), all the rest unknowns are 
determined under guidance of a learning theory in term 
of minimizing an error cost. Such a minimization is 
implemented via an optimizing procedure by a learn- 
ing algorithm, based on a training set {x ty y t } t=1 that 

teaches a target^ for each mapping x t -> R m . While in 
the stream of combing classifiers, all {p 7 -(.y | *)}y =1 are 
known without unknowns left to be specified. Also, M 
is designed according to certain heuristics or principles, 
with or without help of a training set, and studies are 
mainly placed on developing and analyzing different 
combining mechanisms, for which we will further dis- 
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cuss subsequently. The final combining performance 
is empirically evaluated by the misclassification rate, 
but there is no effort yet on developing a theory for one 
M that minimizes the misclassification rate or a cost 
function, though there are some investigations on how 
estimated posteriori probabilities can be improved by 
a sum rule and on error sensitivity of estimated pos- 
teriori probabilities (Turner & Ghosh, 1996; Kittler, et 
al, 1998). This under-explored direction also motivate 
future studies subsequently. 



f-COMBINATION 

The arithmetic, geometric, and harmonic mean of non- 
negative number b >0,j = l,...,k has been further 
extended into one called: 



M(x)=f- 1 (^a j (x)f(p j (y\x))\or 

7=1 

f(M(x)) = j]a j (x)f(p j (y\x)) 




7=1 



where 



a>0Z^ ; W = 1 . 



In the following, we discuss to use it as a general 
framework to unify not only typical classifier combin- 
ing rules but also mixture-of-expert learning and RBF 
net learning, as shown in Tab. 3. 

We observe the three columns for three special cases 
of f(r). The first column is the case f(r) = r, we return 
to the ME model: 



f-meanm^f-'C^ajfib.)), 



7=1 



where f(r) is a monotonic scalar function, and 



ot>0Y a 7 =l 



(Hardy, Littlewood, & Polya, 1952). 

We can further generalize this f-mean to the general 
architecture shown in Fig.l, resulting in the following 
/-combination: 



M(x) = ^a j (x)p j (y\x), 

7=1 

which was proposed firstly for combining multiple 
regressions in (Jacobs, et al, 1991) and then for com- 
bining classifiers in (Xu & Jordan, 1993). For different 
special cases of a (x), we are lead to a number of existing 
typical examples. As already pointed out in (Kittler, et 
al, 1998), the first three rows are four typical classifier 
combining rules (the 2 nd row directly applies to the 
min-rule too). The next three rows are three types of 
ME learning models, and a rather systematic summary 



Figure 2. Three basic ingredients 



combination versus learning 



design and then 
evaluation 



error reduction 
learning 
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Table 3. Typical examples (including several existing rules and potential topics) 



a f (x) 




f(M) 



fl[r)=r 



f(r)=/» r 





aj(x)- 0, except a. (x)- 1 
argmaXj^frlK), (a), 
argmLiijPjfylJc), (b> 



(a) max-ruk 

(b) min-iult- 



(a) max- rule 

(b) min-rule 



(a) max -rule 

(b) mi u -rule 



ecj(x). 



• \lk 



Average Haves oi' 
Bayev voting (x.v. 
Krzyzik & Sum. 1P92) 



Product rule 

<Xu< Krzyrak & Sugil 
|W2; kuil.-i ?f aL 
1<H>Sl Hinron, 2002} 



Harmonic 
in e mii 






.Mixture using 
variances (MUV) 



To be explored 



Tci be explored 



ff/*i= 



C J 



fitxrf* 



gJLe"'*" 



Alixl uic- o f- expert s 
(ME) (Jacobs eral. 



I o be explored 



To be explored 



or (a-) = — r^ { — 



Alternative ME fXu, 
Joidnii& EUn ton. 1994) 



To be explored To be explored 






sitbjectioflj / jj X, = const 



Extended 
Normalized RRF 

(Xu, 199S) 



To be explored 



To be explored 



^=a;\x)j^cf(x) 



Belief net based 

Ml V ( Lee t et rtl, 200£) 



To be explored 



To be explored 



is referred to Sec. 4.3 in (Xu, 2001). The last row is a 
recent development of the 3 rd row. 

The 2nd row of the 2nd column is the geometric 
mean: 



M(*) = £cx.(x)p.();|x) 

7=1 

is just a marginal probability 



M(x) = kYl Pj (y\x), 



7=1 



which is equal to the product rule (Xu, Krzyzak and 
Suen, 1992; Kittler, et al, 1998, Hinton, 2002) if each 
a priori is equal, i.e., a (x) = lira. Generally if a (x) ^ 
1/m, there is a difference by a scaling factor a.(x) 1/k_1 . 
The product rule works in a probability theory sense 
under a condition that classifiers are mutually indepen- 
dent. In (Kittler, et al, 1998) , attempting to discuss a 
number of rules under a unified system, the sum rule 
is approximately derived from the product rule, under 
an extra condition that is usually difficult to satisfy. 
Actually, such an imposed link between the product 
rule and the sum rule is unnecessary, the sum: 



XXy,;!*)' 

7=1 

which is already in the framework of probability theory. 
That is, both the sum rule and the product rule already 
coexist in the framework of probability theory. 

On the other hand, it can be observed that the 
sum: 

k 

£a.(x)lnp.(j;[x) 

7=1 

is dominated by a p (y\x) if it is close to 0. That is, 
this combination expects that every expert should cast 
enough votes, otherwise the combined votes will be 
still very low just because there is only one that casts 
a very low vote. In other words, this combination can 
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be regarded as a relaxed logical AND that is beyond 
the framework of probability theory when a (x) ^ 1/m. 
However, staying within the framework of probability 
theory does not mean that it is better, not only because 
it requires that classifiers are mutually independent, 
but also because there lacks theoretical analysis on 
both rules in a sense of classification errors, for which 
further investigations are needed. 

In Tab. 2, the 2nd row of the third column is the 
harmonic mean. It can be observed that the problem 
of combining the degrees of support is changed into 
a problem of combining the degrees of disagree. This 
is interesting. Unfortunately, efforts of this kind are 
seldom found yet. Exceptionally, there are also ex- 
amples that can not be included in the f-combination, 
such as Dempster-Shafer rule (Xu, Krzyzak and Suen, 
1992; Al-Ania, 2002) and rank based rule (Ho, Hull, 
Srihari, 1994). 



Thus, the discussions on the examples in Tab. 2 are 
applicable to this f a (r). Moreover, the first row in Tab.2 
holds when a = -oo and a = +oo for whatever a gating 
net, which thus includes two typical operators of the fuzzy 
system as special case too. Also, the family is systematically 
modulated by a parameter - go < a < +oo, which provides 
not only a spectrum from the most optimistic integration 
to the most pessimistic integration as varying from 
a = -oo to a = +oo but also a possibility of adapting a 
for a best combining performance. 

Furthermore, Amari(2007) also provides a theoreti- 
cal justification that a-integration is optimal in a sense 
of minimizing a weighted average of a-divergence. 
Moreover, it provides a potential road for studies on 
combining classifiers and learning mixture models from 
the perspective of information geometry. 



FUTURE TRENDS 




a-INTEGRATION 

After completed the above f-combination, the first 
author becomes aware of the work by (Hardy, Little- 
wood, & Polya, 1952) through one coming paper 
(Amari, 2007) that studies a much concise and more 
useful one-parameter modulated special case called 
a-integration. With help of a concrete mathematical 
foundation from an information geometry perspective. 
Imposing an additional but reasonable nature that the 
f-mean should be linear scale-free, i.e.: 

cm^f-^ajficbj)) 

7=1 

for any scale c, alternative choices of f(r) reduces into 
the following only one: 




It is not difficult to check that 



fa (r) = 



r, a = -1, 
In r, a = 1, 
II r, a =3. 



Further studies are expected along several directions 
as follows: 

Empirical and analytical comparisons on per- 
formance are needed for those unexplored or less 
explored items in Tab.2. 

Is there a best structure for a (x)? comparisons need 
to be made on different types of a (x), especially 
the ones by the MUV type in the last row and the 
ME types from the 4 th to the 7 th rows, 
Is it necessary to relax the constraint: 

a.(x)>0,X 7 l i a 7 (x)=l, 

e.g., removing non-negative requirement and to 

relax the distribution p (y\x) to other types of 
functions ? 

How weights a (x) can be learned under a gen- 
eralization error bound. 

As discussed in Fig.2, classifier combination and 
mixture based learning are two aspects with dif- 
ferent features. How to let each part to take their 
best roles in an integrated system? 



CONCLUSION 



Updating the purpose of (Xu, Krzyzak & Suen, 1992), 
the article provides not only a general sketch of studies 
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on combining classifiers and learning mixture models, 
but also a general combination framework to unify a 
number of classifier combination rules and mixture 
based learning models, as well as a number of direc- 
tions for further investigations. 
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KEY TERMS 

Conditional Distribution p(y\x): Describes the 
uncertainty that an input x is mapped into an output y 
that simply takes one of several labels. In this case, 
x is classified into the class label y with a probability 
p(y\x). Also, y can be a real-valued vector, forwhichxis 
mapped into y according density distribution p(y\x). 

Classifier Combination: Given a number of clas- 
sifiers, each classifies a same input x into a class label, 
and the labels maybe different for different classifiers. 
We seek a rule M(x) that combines these classifiers as a 
new one that performs better than anyone of them. 

Sum Rule (Bayes Voting): A classifier classifies 
x to a label y can be regarded as casting one vote to 
this label, a simplest combination is to count the votes 
received by every candidate label. They-th classifier 

classifies x to a label y with a probability p (y\x) 
means that one vote is divided to different candidates 
in fractions. We can sum up: 



to count the votes on a candidate label y , which is 
called Bayes voting since p(y\x) is usually called 
Bayes posteriori probability. 

Product Rule: When k classifiers {e y (x)}* =1 are 
mutually independent, a combination is given by 



Ylp(xeC y \ej(x) 

p(xeC\x) = p(xeCy= 




7=1 
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or concisely 

P(y\x) = p 1 - k (y)f\p j (y\x) 

7=1 

which is also called product rule. 

Mixture of Experts: Each expert is described 

by a conditional distribution p (y\x) either with 
y taking one of several labels for a classification 
problem or with y being a real-valued vector for 
a regression problem. A combination of experts 
is given by: 



M(x) = ^CL j (x)p j (y\x), a j (x)=p(j\x)>0, £a.(x) = l, 

7=1 7=1 

which is called a mixture-of-experts model. Particularly, 
for y in a real- valued vector, its regression form is 



E(y\x) = ^a j (x)f j (x) f fj(y) = \yPj(y\x)dy. 

7=1 
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f-Mean: Given a set of non-negative numbers 
b > 0,7 =l,...,k, the f-mean is given by: 



7=1 

where f(r) is a monotonic scalar function and 
i.>0Y* a,.= L 

j Laj=\ 7 



a 



Particularly, one most interesting special case is that 
f(r) satisfies 

cm^f-^oLjficbj)) 

7=1 



for any scale c, which is called f a -mean. 

Performance Evaluation Approach: It usually 
works in the literature on classifier. Combination, with 
a chart flow that considering a set of classifiers 

{e (x) k =1 -> designing a combining mechanism 
M(x) according to certain principles -> evaluat- 
ing performances of combination empirically via 
misclassification rates, in help of samples with 
known correct labels. 

Error-Reduction Approach: It usually works in 
the literature on mixture based learning, where what 
needs to be pre-designed is the structures of classifiers 
or experts, as well as the combining structure M(x) with 
unknown parameters. A cost or error measure is evalu- 
ated via a set of training samples, and then minimized 
through learning all the unknown parameters. 
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INTRODUCTION 

Significant advances in artificial intelligence, including 
machines that play master level chess, or make medi- 
cal diagnoses, highlight an intriguing paradox. While 
systems can compete with highly qualified experts in 
many fields, there has been much less progress in con- 
structing machines that exhibit simple commonsense, 
the kind expected of any normally intelligent child. As 
a result, commonsense has been identified as one of the 
most difficult and important problems in AI (Doyle, 
1984; Waltz, 1982). 



BACKGROUND 

The Importance of Commonsense 1 

It may be useful to begin by listing a number of reasons 
why Commonsense is so important: 

1. Any general natural language processor must 
possess the commonsense that is assumed in the 
text. 

2 . In building computerized systems, many assump- 
tions are made about the way in which they will be 
used and the users' background knowledge. The 
more commonsense that can explicitly be built 
into systems, the less will depend on the implicit 
concurrence of the designer 's commonsense with 
that of the user. 

3. Many expert systems have some commonsense 
knowledge built into them, much of it reformulated 
time and again for similar systems. It would be 
advantageous if commonsense knowledge could 
be standardized for use in different systems. 

4. Commonsense has a large element that is environ- 
ment and culture specific. A study and formaliza- 
tion of commonsense knowledge may permit 
people of different cultures to better understand 
one another's assumptions. 



Defining Commonsense 

No attempt will be made here to define commonsense 
rigorously. Intuitively, however, commonsense is 
generally meant to include the following capabilities, 
as defined for any given culture: 

a. knowing the generally known facts about the 
world, 

b. knowing, and being able to perform, generally 
performed behaviors, and to predict their out- 
comes, 

c. being able to interpret or identify commonly oc- 
curring situations in terms of the generally known 
facts - i.e. to understand what happens, 

d. the ability to relate causes and effects, 

e. the ability to recognize inconsistencies in descrip- 
tions of common situations and behaviors and be- 
tween behaviors and their situational contexts, 

f. the ability to solve everyday problems. 

In summary, commonsense is the knowledge that 
any participant in a culture expects any other participant 
in that culture to possess, as distinct from specialized 
knowledge that is possessed only by specialists. 

The necessary conditions for a formalization to lay 
claim to representing commonsense are implicit in the 
above definition; a formalism must exhibit at least one 
of the attributes listed there. Virtually all work in the 
field has attempted to satisfy only some subset of the 
commonsense criteria. 



COMMONSENSE REPRESENTATION 
FORMALISMS 

In AI research, work on common sense is generally 
subsumed under the heading of Knowledge Representa- 
tion. The obj ective of this article is to survey the various 
formalisms that have been suggested for representing 
commonsense knowledge. 
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Four major knowledge representation schemes are 
discussed in the literature - production rules, semantic 
nets, frames, and logic. Production systems are fre- 
quently adopted in building expert systems. Virtually 
all the discussions of commonsense representations, 
however, are in terms of semantic net, frame-like, or 
logic systems. These schemes are applied within three 
main paradigms for commonsense representation — 
propositional, truth maintenance, and dispositional 
(see Figure 1). Very briefly, propositional models are 
descriptions of representations of things or concrete 
facts. When the knowledge represented is imprecise 
or variable, propositional formalisms are no longer 
sufficient and one needs to consider the beliefs about 
the world engendered by the system's current state of 
knowledge, and to allow for changes in those beliefs 
as circumstances dictate; this is the nature of belief or 
truth maintenance systems. Finally, when the knowledge 
is both imprecise and not factual, but relates rather to 
feelings, insights and understandings, the dispositional 
representations are evoked. 

Within each representational paradigm, there are a 
number of specific formalisms. Figure 1 indicates the 
existence of eight different knowledge representation 
formalisms. Each of these formalisms is presented via 
discussion of one or more representatives. 

The need for different types of formalisms, the 
difficulty in representing multiple domain knowledge, 
psychological theories of various levels of conscious- 



ness, the physiological evidence of different levels of 
the brain and their association with specific functions, 
the functional specialization of specific areas of the 
brain, and similar evidence concerning the two sides 
of the brain all support the view of the mind, or self, 
as composed of a considerable number of cooperating 
subagents, to which Minsky (1981) refers as a society 
of mind. It is useful to keep this concept in mind while 
studying the variety of representation schemes; it sug- 
gests that a number of such formalisms may coexist in 
any rational agent and little can be gained by attempts 
to choose the "right" formalism in any general sense. 



PROPOSITIONAL MODELS 

Virtually all the propositional models of commonsense 
knowledge are perceived as consisting of nodes that are 
associated with words or tokens representing concepts. 
The nodes are hierarchically structured, with lower 
level nodes elaborating or representing instantiations of 
higher-level nodes; the higher-level nodes impart their 
properties to those below them, which are said to inherit 
those properties. Thus, all the propositional models 
are hierarchically structured networks consisting of 
nodes and arcs joining the nodes. From this point, the 
representational structures begin to diverge according 
to the distribution of information between the arcs and 
the nodes. At one extreme, nodes are self-contained 



Figure 1. Commonsense knowledge representation formalisms 
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descriptions of concepts with only hierarchical rela- 
tions between concepts expressed in the arcs; frames 
and scripts are of this nature. At the other extreme, 
nodes contain only names and the descriptive content 
resides in a multiplicity of types of relations, which 
give meaning to the names. This last form is generally 
referred to as a semantic net. Between the extremes lie 
representations in which the descriptive knowledge is 
distributed between nodes and relations. 

The representations described may be discussed as 
theoretical models or as computer implementations, 
which generally derive from one of the theoretical 
models (see Commonsense Knowledge Representation 
II - Implementation). 

Semantic Nets 

Semantic nets were developed to represent propositions 
and their syntactic structure rather than the knowledge 
contained in the propositions. A semantic net consists 
of triples of nodes, arcs j oining nodes, and labels (Rich, 
1983). The words of a proposition are contained in the 
nodes while its syntactic structure is captured by the 
labeled arcs. 

Frames 

Within the community utilizing frames (or schema), 
it seems to be universally agreed that frames define 
concepts via the contents of slots, which specify com- 
ponents of the concepts. The content of a slot may be an 
instantiation, a default or usual value, a condition, or a 
sub-schema. The last of these slot contents generates 
the generally accepted hierarchical nature of concept 
representations. 

Scripts 

In his seminal paper on frames, Minsky (1975) suggests 
that frames are not exactly the same as Schank's (1981) 
scripts. The major difference is in the concepts repre- 
sented. Scripts describe temporal sequences of events 
represented by the sequence of slots in a script. Here 
the slot structure is significant. For frames describing 
objects the exact order of slots is probably not signifi- 
cant. The temporal ordering of slots is necessary in 
many other representations, including those for most 
behaviors. 



In summary, frames, scripts, and semantic nets are 
logically equivalent. All three are hierarchies consist- 
ing of nodes and relations; they are differentiated by 
the location of information in the network. Scripts 
are further differentiated by the temporal and causal 
ordering of slots within nodes. 

Predicate Calculus 

The first order predicate calculus is a method for 
representing and processing propositional knowledge 
and was designed expressly for that purpose. There 
are many workers in the field who view this formalism 
as the most appropriate for commonsense knowledge 
representation. 

In discussing logic as a vehicle for knowledge rep- 
resentation, one should distinguish between the use 
of logic as the representational formalism, and the use 
of logistic languages for implementation. Logistic lan- 
guages can implement any representational formalism, 
logistic or other. Another distinction that needs to be 
made is between the use of logic for inference and its 
use for representation (Hayes, 1979). Thus, one might 
apply logical inference to any knowledge representation, 
provided that it is amenable to that kind of manipula- 
tion; knowledge expressed in predicate calculus is the 
most amenable to such manipulation. 

A vigorous debates in AI centered on the choice of 
frames or logic for commonsense knowledge represen- 
tation. In favor of frames were those who attempted 
to implement knowledge representation schemes (cf . 
Hayes, 1979). The argument is twofold: first, logistic 
formalisms do not have sufficient expressive power 
to adequately represent commonsense knowledge. 
This applies particularly to the dynamic adjustment of 
what is held to be true as new information is acquired. 
Classical logic does not allow for reinterpretation as 
new facts are learned - it is monotonic. Secondly, this 
argument posits that human knowledge representation 
does not follow the rules of formal logic, so that logic 
is psychologically inadequate in addition to being 
expressively inadequate. 

The logicians reply to the first argument is that any- 
thing can be expressed in logic, but this may sometimes 
be difficult; this school does not agree to the claim 
of greater expressive power for the frame paradigm. 
Hayes (1979) claimed that "most of 'frames' is just 
a new syntax for parts of first-order logic". Thus, all 
the representational formalisms seem to have much in 
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common, even if they are not completely equivalent. 
The reply to the second argument is that the object of 
study is Artificial Intelligence so that the formalisms 
adopted need not resemble human mechanisms. 

One point of agreement that seems to have emerged 
between the two sides is that the classical predicate 
calculus indeed lacks sufficient expressive power. 
Evidence of this is a number of attempts to expand 
the logical formalisms in order to provide sufficient 
expressiveness, especially by alleviating some of the 
restrictions of monotonicity. 

The essence of monotonicity is that if a theory A 
is expanded by additional axioms to become B, then 
all the theorems of A are still theorems of B. Thus, 
conclusions in monotonic logic are irreversible and 
this constraint must be relaxed if a logical inference 
system is to be able to reevaluate its conclusions on 
learning new facts. The conclusions of a non-monotonic 
logic are regarded as beliefs or dispositions based on 
what is currently known, and amenable to change if 
necessary. 

The attempts to overcome the limitations of mono- 
tonicity include non-monotonic reasoning McCarthy 
(1980), non-monotonic logics (McDermott and Doyle, 
1980; Reiter 1980), fuzzy logic (Zadeh, 1983), and a 
functional approach (Levesque, 1984) represented by 
the KL-One (Woods, 1 983) and KRYPTON languages 
(Brachman et al., 1983). It should be noted that non- 
monotonic logic proceeds not by changing the logic 
representation, but rather by strengthening the reasoning 
by which representations are evaluated. 



1. A proposition is regarded as a collection of (usu- 
ally implicit) fuzzy constraints. 

2. An explanatory database contains lists of samples 
of the subject of the proposition together with 
the degree to which each predicate or constraint 
is fulfilled. These data are used to compute test 
scores of the degree to which each constraint is 
met. These test-scores then become the meaning 
of the fuzzy constraints. 

Circumscription (McCarthy, 1980) 

Circumscription is a form of non-monotonic reasoning 
(as distinct from non-monotonic logic) which reduces 
the context of a sentence to anything that is deduc- 
ible from the sentence itself, and no more. Without 
such a mechanism, any statement may invoke all the 
knowledge the system possesses that is associated with 
whatever the topic of the statement is, much of which 
would probably be irrelevant. The question 

"Crows and canaries are both birds; why is Tweetie 
afraid of crows?" 

could give rise to consideration of the facts that ostriches 
and penguins are also birds and that ostriches don't fly 
and put their heads in the sand while penguins swim 
and eat fish, all of which is irrelevant to the problem 
in hand. The purpose of circumscription is to limit 
evaluation of problem statements to the facts of the 
statement - what McCarthy describes as the ability to 
jump to conclusions. 



LOGISTIC REPRESENTATIONS: 
BELIEF MAINTENANCE SYSTEMS 

Fuzzy Logic (Zadeh, 1983) 

The extensive work by Zadeh in applying fuzzy logic 
to knowledge representation is based on the premise 
that predicate calculus is inadequate for commonsense 
knowledge representation because it does not allow for 
fuzzy predicates (e.g. small, cheap) or fuzzy quanti- 
fiers (e.g. many, most) as in the phrase "most small 
birds fly". 

Fuzzy logic proceeds in the representation of dis- 
positions by the following steps: 



Default Reasoning (Reiter, 1980) 

Reiter 's Default Reasoning is based on the first order 
predicate calculus and attempts to solve a particu- 
lar problem that arises in expressing commonsense 
knowledge in the classical formalism. The problem is 
that of drawing conclusions on the basis of incomplete 
knowledge when the missing information can be as- 
sumed, provided there is no evidence to the contrary. 
Such assumptions are possible because much real world 
knowledge about classes of objects is almost always 
true. Thus, most birds are capable of flight with a few 
exceptions such as penguins, ostriches, and kiwis, and 
provided they do not suffer from specific inabilities 
such as death or feet embedded in concrete. 
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Non-Monotonic Logic (McDermott & 
Doyle, 1980) 

Non-monotonic logic attempts to formalize the repre- 
sentation of knowledge in such a way that the models 
maintained on the basis of the premises supplied can 
change, if necessary, as new premises are added. Thus, 
rather than determining the truth or falsity of statements 
as in classical logic, non-monotonic logics hold beliefs 
based on their current level of knowledge. 



DISPOSITIONAL MODELS 

Dispositional models attempt to establish the relation- 
ship between knowledge representation and memory. 
To paraphrase Schank (1981), understanding sentences 
involves adding information, generally commonsense, 
not explicit in the original sentence. Thus, a memory 
full of facts about the commonsense world is necessary 
in order to understand language. Furthermore, a model, 
or set of beliefs is necessary to provide expectations 
or explanations of an actor's behavior. Adding beliefs 
to a representation changes the idea of inference from 
simply adding information to permit parsing sentences 
to integrating data with a given memory model; this 
leads to the study of facts and beliefs in memory. 

K-Lines (Minsky, 1981) 

Minsky's major thesis is that the function of memory 
is to recreate a state of mind. Each memory must em- 
body information that can later serve to reassemble 
the mechanisms that were active when the memory 
was formed. 

Memory is posited to consist, at the highest level, of a 
number of loosely interconnected specialists - a "society 
of mind". Each of the specialists, in turn, comprises 
three lattices with connections between them. 

The most basic lattice comprises "mental agents" 
or P-nodes. Some of these agents participate in any 
memorable event - an experience, an idea, or a problem 
solution - and become associated with that event. Re- 
activation of some of those agents recreates a "partial 
mental state" resembling the original. 

Reactivation of P-nodes and consequent recreation 
of partial mental states is performed by K-Lines at- 
tached to nodes. The K-Lines originate in K-nodes 
embedded in a second lattice, the K-Pyramid. The 



establishment of a K-Line between a K-node and some 
P-nodes occurs when a memorable mental event takes 
place. Information in the K-Pyramid flows downward, 
but not upward. 

The third structure in Minsky's model is the N-Pyra- 
mid. Its function is to permit learning to take place, and 
it does so by constructing new K-nodes for P. 

It may be useful to think of this structure as analogous 
to an ivory mandarin ball - spheres within spheres. 

Memory Organization Packets: MOPS 
(Schank, 1981) 

While Minsky presents a general theory of memory 
without reference to the representation of concepts, 
Schank's model is highly dependent on the inter- 
relationship of representations and memory. This is, 
perhaps, the result of the particular domain explored 
by Schank — temporal sequences of events. 

The scripts, plans, goals, and themes of Schank's 
model are reminiscent of Minsky's P, K, and N pyramids, 
although including an additional level. An attempt to 
specify the relationship more precisely suggests the 
equivalence of scripts with the P-Pyramid, of plans 
with the K-Pyramid, and of goals with the N-Pyramid. 
The level of themes does not seem to be included in 
Minsky's model, although this may be a reflection of 
the "society of mind", with each theme representing a 
member of the society. 

In addition to differentiating levels of description, 
Schank's model distinguishes four levels of memory 
by the degree of detail represented. These levels are: 

1. Event Memory (EM) - specific remembrances of 
particular situations. After a while, the less salient 
aspects of an event fade away leaving generalized 
events plus the unusual or interesting parts of the 
original event. 

2. Generalized Event Memory (GEM) - collocations 
of events whose common features have been 
abstracted. This is where general information is 
held about situations that have been experienced 
numerous times - e.g. dentists' waiting rooms. 

3. Situational Memory (SM) - contains information 
about specific situations - e.g. going to medical 
practitioners' offices. 

4. Intentional Memory (IM) - remembrances of 
generalizations at a higher level than SM - e.g. 
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getting a problem taken care of by an organiza- 
tion. 

Scripts of particular situations do not exist as perma- 
nent memory structures but are reconstructed from more 
general, higher level structures, to aid in interpreting 
events as they unfold. The parts of scripts necessary for 
interpreting an event are retrieved from situational level 
memory structures referred to as Memory Organiza- 
tion Packets - MOPs. MOPs are collections of events 
that have become generalized ("mushed together" in 
Schank's words) and stored under them. MOPs are the 
means by which an appropriate episode in memory 
can be retrieved to aid in interpreting a specific event. 
Thus, connections must be established in memory 
from MOPs to the specific events that they are invoked 
to help process. These connections are evocative of 
Minsky's K-Lines. 

A MOP aids in processing an event by virtue of 
the fact that it contains abstractions of similar previ- 
ous events and, together with particularly memorable 
exceptions stored in EM, provides expectations about 
the development of events. Thus, a MOP is a high level 
script and is related to even higher-level MOPs. In 
Schank's example, a restaurant script is a lower level 
MOP with connections to more general MOPs such 
as social situations, contracts, and obtaining services. 
Like MOPs, scripts are subject to temporal precedence 
search, produce conceptual dependencies, and contain 
memories, so they may be thought of as sub-MOPs. 



FUTURE TRENDS AND CONCLUSION 

Although considerable effort has been spent on attempts 
to formalize commonsense knowledge representation, 
none of these has yet produced an entirely satisfactory 
result. Thus, there is still considerable room for work 
in this area 

Furthermore, there has been little theoretical 
work on commonsense representation formalisms in 
recent years. The bulk of the efforts have shifted to 
projects utilizing various formalisms to implement 
commonsense knowledge bases. See "Commonsense 
Knowledge Representation II - Implementation" in 
this Encyclopedia. 
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KEY TERMS 

Belief Maintenance Systems: Systems of logic 
that permit theorems to be updated as new knowledge 
becomes available. 

Commonsense Knowledge: Knowledge of the 
basic facts and behaviors of the everyday world. 

Dispositional Models: Representations of things 
or facts. 

Logistic Models: Modified logics that attempt to 
overcome the problems of representing commonsense 
knowledge in the classic predicate calculus. 

Monotonicity: A characteristic of logic that prevents 
changes to existing theorems, when new information 
becomes available. 

Non-Monotonic Logic: A logic that attempts to 
overcome the restrictions of monotonicity. 

Propositional Models: Descriptions of representa- 
tions of things or concrete facts. 

Representation Formalisms: Theoretical frame- 
works for representing commonsense knowledge. 



ENDNOTE 



In this article, "commonsense" is written as 
one word, to distinguish such knowledge from 
the more usual "common sense" defined in the 
Oxford English Dictionary as "Good sound 
practical sense; combined tact and readiness in 
dealing with the every-day affairs of life; general 
sagacity." Others use "commonsense" only as an 
adjective. 
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INTRODUCTION 

Early attempts to implement systems that understand 
commonsense knowledge did so for very restricted 
domains. For example, the Planes system [Waltz, 1 978] 
knew real world facts about a fleet of airplanes and 
could answer questions about them put to it in English. 
It had, however, no behaviors, could not interpret the 
facts, draw inferences from them or solve problems, 
other than those that have to do with understanding the 
questions . At the other extreme, SHRDLU ( Winograd, 
1973) understood situations in its domain of discourse 
(which it perceived visually), accepted commands in 
natural language to perform behaviors in that domain 
and solved problems arising in execution of the com- 
mands; all these capabilities were restricted, however, 
to SHRDLU's artificial world of colored toy blocks. 
Thus, in implemented systems it appears that there 
may be a trade off between the degree of realism of 
the domain and the number of capabilities that can be 
implemented. 

In the frames versus logic debate (see Common- 
sense Knowledge Representation I - Formalisms in 
this Encyclopedia), the real problem, in Israel's (1983) 
opinion, is not the representation formalism itself, but 
rather that the facts of the commonsense world have 
not been formulated, and this is more critical than 
choice of a particular formalism. A notable attempt to 
formulate the "facts of the commonsense world" is that 
of Hayes [1978a, 1978b, 1979] under the heading of 
naive physics. This work employs first-order predi- 
cate calculus to represent commonsense knowledge 
of the everyday physical world. The author of this 
survey has undertaken a similar effort with respect to 
commonsense business knowledge (Ein-Dor and 
Ginzberg 1989). Some broader attempts to formulate 
commonsense knowledge bases are cited in the section 
Commonsense Knowledge Bases. 



COMMONSENSE AND EXPERT 
SYSTEMS 

The perception that expert systems are not cur- 
rently sufficient for commonsense representation is 

strengthened by the conscious avoidance in that field 
of commonsense problems. An excellent example is the 
following maxim for expert system construction: 

Focus on a narrow specialty area that does not involve 
a lot of commonsense knowledge. ...to build a system 
with expertise in several domains is extremely difficult, 
since this is likely to involve different paradigms and 
formalisms. (Buchanan et al., 1983) 

In this sense, much of the practical work on expert 
systems has deviated from the tradition in Artificial 
Intelligence research of striving for generality, an ef- 
fort well exemplified by the General Problem Solver 
(Ernst and Newell, 1969) and by work in natural 
language processing. Common sense research, on the 
other hand, seems to fit squarely into the AI tradition 
for, to the attributes of common sense (Commonsense 
Knowledge Representation I), it is necessary to add one 
more implicit attribute, namely the ability to apply any 
commonsense knowledge in ANY relevant domain. This 
need for generality appears to be one of the greatest 
difficulties in representing common sense. 

Consider, for example, commonsense information 
about measurement; knowledge of appropriate mea- 
sures, conversions between them, and the duration of 
their applicability are necessary in fields as diverse as 
medicine, business, and physics. However, each expert 
system represents knowledge, including the necessary 
knowledge about measuring scales, in the manner most 
convenient for its specific purposes. No such represen- 
tation is likely to be very useful in any other system 
in the same domain, and certainly not for systems in 
other domains. Thus, it appears that the reason for the 
inability of expert systems as currently developed to 
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represent general purpose common sense is primarily 
a function of the generality of commonsense versus 
the specificity of expert systems. 

From a positive point of view, one of the major 
aims of commonsense systems must be to represent 
knowledge in such a way that it can be useful in any 
domain; i.e. when storage strategies cannot be based on 
prior information about the uses to which the knowl- 
edge will be put. 

This, then, is the major difference between expert 
systems and commonsense systems; while the former 
deal mainly with the particular knowledge and be- 
haviors of a strictly bounded activity, common sense 
must deal with all areas of knowledge and behavior not 
specifically claimed by a body of experts. An expert 
system that knows about internal medicine does not 
know about skin diseases or toxicology and certainly 
not about drilling rigs or coal mining. Common sense 
systems, on the other hand, should know about colds and 
headaches and cars and the weather and supermarkets 
and restaurants and "chalk and cheese and sealing wax 
and cabbages and kings" (Carroll, 1872). 



COMMONSENSE KNOWLEDGE BASE 
IMPLEMENTATIONS 

Given the importance of commonsense knowledge, 
and because such knowledge is necessary for a wide 
range of applications, a number of efforts have been 
made to construct universally applicable commonsense 
knowledge bases. Three of the most prominent are Cyc, 
ConceptNet, and WordNet. 

Cyc 

The Cyc project (Lenat et al. 1990; Lenat, 2006) was 
initiated in 1984 by Douglas Lenat who has been at 
its head ever since. The objective of the project was 
to build a knowledge base of all the commonsense 
knowledge necessary to understand the set of articles 
in an encyclopedia. As of 2005, the knowledge base 
contained about 15,000 predicates, 300,000 concepts, 
and 3,200,000 assertions - facts, rules of thumb and 
heuristics for reasoning about everyday objects and 
events. The project is still active and the knowledge 
base continues to grow. 

The formalism employed in Cyc is the predicate 
calculus and assertions are entered manually. (Cycorp, 



2007). OpenCyc, a freely available version of Cyc may 
be downloaded from http://www.opencyc.org/. 

ConceptNet 

ConceptNet (Liu and Singh, 2004) is a commonsense 
knowledge base and natural-language-processing tool- 
kit that supports many practical textual-reasoning tasks . 
Rather than assertions being registered manually as in 
Cyc, in ConceptNet they are generated automatically 
from 700,000 sentences of the Open Mind Common 
Sense Project (Singh, 2002) provided by over 14,000 
authors., There is a concise version with 200,000 asser- 
tions and a full version of 1.6 million assertions. 
ConceptNet is constructed as a semantic net. 
A freely available version of the system may be 
downloaded at http://web.media.mit.edu/~hugo/ 
conceptnet/#download. 

WordNet 

WordNet (Felbaum, 1998) is described as follows 
(WordNet, 2007): "Nouns, verbs, adjectives and ad- 
verbs are grouped into sets of cognitive synonyms 
(synsets), each expressing a distinct concept. Synsets 
are interlinked by means of conceptual-semantic and 
lexical relations. The resulting network of meaning- 
fully related words and concepts can be navigated 
with the browser. . . . WordNet 's structure makes it a 
useful tool for computational linguistics and natural 
language processing." 

WordNet contains about 155,000 words, 118,000 
synsets, and 207,000 word-sense pairs. 

WordNet is available for free download at http:// 
wordnet.princeton.edu/obtain. 



FUTURE TRENDS AND CONCLUSION 

Any system designed to process natural language must 
contain commonsense knowledge as do many other 
types of systems. Thus, the development of common- 
sense knowledge bases is sure to continue. 

As a complete commonsense knowledge base must 
contain very large quantities of knowledge, the devel- 
opment of such a base is a very lengthy process that 
must be cumulative if it is to achieve its goal. Thus, 
commonsense knowledge base implementations will 
expand and improve over a lengthy period of time. 
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KEY TERMS 

Commonsense Knowledge: Knowledge of the 
basic facts of the everyday world; the knowledge that 
any participant in a culture expects any other participant 
in that culture to possess. 

Commonsense Knowledge Base: A knowledge 
base containing commonsense knowledge and mecha- 
nisms for drawing inferences or processing natural 
language on the basis of that knowledge. 

ConceptNet: A commonsense knowledge base 
implementation structured as a semantic net. 

Cyc: Alarge commonsense knowledge base imple- 
mentation utilizing predicate calculus as the represen- 
tation mechanism. 

Expert Knowledge: Knowledge possessed by 
experts in a particular domain. Systems representing 
expert knowledge are generally rule-based. 

Implementation: The construction of a computer- 
ized system to represent and manipulate commonsense 
knowledge. 

WordNet: Acommonsense knowledge base imple- 
mentation based on a semantic net structure. 
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INTRODUCTION 



BACKGROUND 



In all walks of life individuals are involved in a cumula- 
tive and incremental process of knowledge acquisition. 
This involves the accessing, processing and understand- 
ing of information which can be gained through many 
different forms. These include, deliberate means by 
picking up a book or passive by listening to someone. 
The content of knowledge is translated by individuals 
and often recorded by the skill of note-taking, which 
differs in method from one person to another. This 
article presents an investigation into the techniques to 
take notes including the most popular Cornell method. 
A comparative analysis with the Outlining and Map- 
ping methods are carried out stating strengths and 
weaknesses of each in terms of simplicity, usefulness 
and effectiveness. The processes of developing such 
skills are not easy or straightforward and performance 
is much influenced by cognition. Therefore, such as- 
sociations regarding cognitive conceptions involve 
the exploration into note-taking processes encoding 
and storage, attention and concentration, memory and 
other stimuli factors such as multimedia. 

The social changes within education from the tradi- 
tional manner of study to electronic are being adapted 
by institutes. This change varies from computerising 
a sub-component of learning to simulating an entire 
lecture environment. This has enabled students to 
explore academia more conveniently however, is still 
arguable about its feasibility. The article discusses the 
underlying pedagogical principles, deriving instructions 
for the development of an e-learning environment. 
Furthermore, embarking on Tablet PC's to replace the 
blackboard in combination with annotation applications 
is investigated. Semantic analysis into the paradigm 
shift in e-learning and knowledge management replac- 
ing classroom interaction presents its potential in the 
learning domain. The article concludes with ideas for 
the design and development of an electronic note-tak- 
ing platform. 



Over the years, research into note-taking has been car- 
ried out intensively. The paper aims to comparatively 
analyse the various note-taking techniques, providing 
an explanation into the effectiveness and simplicity. 
The relationship between cognition and note-taking 
is studied presenting a breakdown into the processes 
involved. Due to the vast amount of research into 
cognition its relevance is imperative. Although, great 
research within both areas has been undertaken to design 
an electronic note-taking tool, an analysis into existing 
applications has also been conducted, with Microsoft 
OneNote being the most favourable. This is an anno- 
tation application that has no predefined technique to 
record notes or annotations and saves handwriting as an 
image. Throughout the literature many authors work 
contributing to this study will be presented. 



COMPARATIVE STUDY 

This article presents an insight into note-taking, the vari- 
ous methods, cognitive psychology and the paradigm 
shift from traditional manner of study to electronic. 

Note-Taking Techniques 

Theoretically, note-taking is perceived as the transfer 
of information from one mind to the other. Today, 
the most popular note-taking technique is the Cornell 
note-taking method, also referred to as 'Do-it-right- 
in-the-first-place'. This note-taking method was de- 
veloped over 40 years ago by Professor Walter Pauk 
at the Cornell University (Pauk & Owens, 2005). The 
main purpose of developing this method was to assist 
students to organise their notes in a meaningful man- 
ner. This technique involves a systematic approach 
for arranging and condensing notes without the need 
to do multiple recopying. The method is simple and 
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effective specifying three areas only. AreaAkeywords, 
Area B notes-taking and Area C summary. 

Area A is assigned to keywords or phrases, which 
are formed by students towards the end of the lecture. 
Over the years an alternative has been questions aiding 
recall over recognition. These cues are known to assist 
memory and pose as a reminder alongside helping to 
identify relationships, also referred as the Q-System 
(Pauk & Owens, 2005). Area B remains for the record- 
ing of notes during lecture. Here the student attempts 
to capture as much information as possible. Finally, 
Area C is left for the student to summarise the notes 
and reflect upon the main ideas of the lecture (Pauk 
& Owens, 2005). 

The main advantage of this technique is its clear-cut 
and organised structure. This technique is also suit- 
able for technical modules including Mathematics and 
Physics and non-technical modules such as English and 
History. During an engineering and applied sciences 



Figure 1. The Cornell note-taking method (adapted 
from Pauk & Owens, 2005) 
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Figure 2. An example of the outlining note-taking 
method 
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workshop, experiments involving 70 student partici- 
pants revealed this note-taking method is straightfor- 
ward (Anderson-Rowland, Aroz, Blaisdell, Cosgrove, 
Fussell, McCartney & Reyes, 1996). The authors 
Anderson-Rowland et al. (1996), state this method 
enables the organisation of notes, entails interaction 
and concentration therefore; a scheduled review can 
be conducted immediately highlighting keywords. 
Moreover, as the students can summarise content this 
facilitates learning by increasing understanding. Ad- 
ditionally, the strength of the technique is the ability 
to take notes instantaneously, saving time and effort 
due to its systematic structure. 

In comparison, the Outlining method (see Figure 
2) has a more spatial and meaningful layout, implicitly 
encoding conceptual relations. For example, indenta- 
tion may imply grouping and proximity conceptual 
closeness (Ward & Tatsukawa, 2003). 

The method consists of dashes or indentation and is 
not suitable for technical modules such as Mathematics 
or Physics. This technique requires indentation with 
spaces towards the right for specific facts. Relationships 
are represented through indentation. Note-takers are 
required to specify points in an organised format arrang- 
ing a pattern and sequence built by space indentation. 
Important points are separated and kept furthest to the 
left, bringing in more specific points towards the right. 
Distance from the major point indicates the level of 
importance. The main advantage of this technique is 
the neatly organised structure allowing reviewing to be 
conducted without any difficulty. However, the Outlin- 
ing method requires the student's full concentration to 
achieve maximum organisation of notes. Consequently, 
the technique is not appropriate if the lecturer is going 
at a fast pace. The method has also been disapproved 
by Fox (1959) because of its confusing organisational 
structure. This is mainly due to the arrangement of 
numerals, capitalised letters and so forth. 

In contrast to the Cornell and Outlining methods, 
the Mapping method (see Figure 3) is a graphical 
representation of the lecture content. Students are 
stimulated to visually determine links illustrating re- 
lationships between facts and concepts. Concept maps 
enable brainstorming, breakdown and representation of 
complex scenarios, identifying and providing solutions 
for flaws and summarising information. To enhance 
accuracy students must actively participate and initiate 
critical thinking. However, a drawback arguably, has 
been the structural organisation and relationship of 
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nodes especially, questioning the hierarchical structure 
(Hibberd, Jones, & Morris, 2002). Alternative structures 
have been discussed including chain, spider maps and 
networks (Derbentseva & Safayeni, 2004). 

E-Learning 

E-learning, also known as distance education employs 
numerous technological devices. Today, educational 
institutes are well-known to deliver academia over the 
Internet. The growth of the internet is ever-increasing 
with great potential for not only learning material but 
also as a collaborative learning environment. The major 
difference between traditional learning and electronic 
learning is the mode of instruction by which informa- 
tion is communicated. 

To have a successful e-learning system all sub-com- 
ponents and interrelated processes must be considered, 
because if one process fails then the entire system fails. 
Therefore, it is necessary to derive a series of peda- 
gogical principles. Underlying pedagogical principles 
include considering the user's behaviour towards the 
system, as it is an isolated activity resulting in user's 
becoming frustrated. As the internet has a great deal of 
knowledge it can be presented in a bias manner provid- 
ing users with partial information. Considerations for 
the environment and the user actions to be performed 
to achieve a specific goal must be outlined. Moreover, 
user's interpersonal skills including their attitudes, 
perceptions and behaviour are central to influencing 
the effectiveness of the system. It has been learned 



e-learning reduces teaching time, increase proficiency 
and improves retention. However, this is not always 
correct, as in one study, results of online lecture notes 
showed students performed weaker (Barnett, 2003). 
From student's perspective, they continuously pursue 
communication and support to influence their learning. 
They also welcome constructive feedback (Mason, 
2001). 

The major challenges faced by the e-learning society 
are the culture clash and lack of motivation towards 
an electronic learning environment. People are just 
not prepared to accept the change. A major factor 
determining successful outcomes of such systems is 
the level of interactivity provided. Other significant 
pedagogical principles include the level of control 
users feel, in comparison to traditional manner of 
learning where the lecturer has full control over the 
lecture environment. Moreover, the development of 
a suitable interface is necessary considering usability 
factors such as efficiency, error rates, memoryability, 
learnability and subj ect satisfaction. As a tutor, the plan- 
ning behind the course structure is important, paying 
close attention towards the structure of content. Tutors 
must ensure students receive feedback in an appropriate 
time maintaining time management. Overall, before 
considering the design and development of an e-learn- 
ing environment the main factors to consider include 
the learners, content, technology, teaching techniques, 
and so forth (Hamid, 2002). 

Technologies associated with e-learning have 
increased usage of bandwidth and internet access. 
Presently, there are two key technologies used to de- 




Figure 3. Example of the mapping method 
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liver e-learning, ' scheduled delivery platforms' and 
'on-demand delivery platforms'. Scheduled delivery 
platforms include multicasts, virtual libraries and remote 
laboratories, yet the constraints of these include time 
and area restraints. To further improve these systems, 
on-demand delivery platforms provide 24 hour support 
maintaining flexibility of learning (Hamilton, Richards 
& Sharp, 2001). 

The efficacy of electronic notes in one particular 
study showed improvement in students understanding 
of lecture content, resulting in an overwhelming 96% 
users feeling e-notes are an effective tool (Wirth, 2003). 
Users are able to annotate, collaborate and discuss 
subject content. This increases learning efficiency by 
allowing the user to engage into the text, improving 
their comprehension and supporting memorisation. 
Recall of specific details is also enhanced (Wolfe, 
2000). Numerous annotation applications have been 
introduced including Microsoft Word and OneNote 
primarily concerned with annotation. SharePoint™ 
allows manipulation editing and annotation simultane- 
ously as well as Re:Mark™. Microsoft OneNote as an 
annotation tool is more popular however, the end-user 
is required to possess a copy of it in order to use it, 
unlike Microsoft Journal which can be exported as a 
.mhtmlfile (Cicchino, 2003). Additionally, handwriting 
is translated as an image rather then text that can be 
explored (McCall, 2005). The benefits include ability 
to record a lecture and annotate, referencing to specific 
points within the recording. These can then be used 
later (McCall, 2005). 

Tablet PC 's are being used to replace the blackboard 
presenting course material through a data projector. 
Many academic institutes are adopting these as a teach- 
ing tool (Clark, 2004) or are providing students with 
them as teaching devices (Lowe, 2004). The major 
strength of a Tablet PC is its interactivity between 
face-to-face instructions (Cicchino, 2003). However, a 
study that provided students with pen-based computers 
for the purpose of taking notes presented the project 
as unsuccessful because the devices where unsuitable 
in terms of performance, poor resolution, and network 
factors (Truong, Abowd & Brotherton, 1999). 

Cognition 

The most common mode of instruction in higher edu- 
cation is lectures where attending students take notes 
based on the lecture content (Tran & Lawson, 200 1 ). In 



class, students spend approximately 80% of their time 
listening to the lecture (Armbruster, 2000) and the reason 
students take notes is because of their usefulness towards 
learning and due to social pressures (Tran & Lawson, 
2001). Students vary their note-taking technique ac- 
cording to personal experience, existing knowledge, and 
appropriateness to the lecture format. Problems within 
the classroom are caused due to student's inability to 
copy information presented by the lecturer (Komagata, 
Ohira, Kurakawa & Nakakoji, 200 1 ). Whilst engaging 
in reading or listening metacognition, the human mind 
has a tendency to wander off thinking about thoughts 
other than what is being taught or learnt. 

During the learning process, effects on learning 
occur during the encoding and storage processes. 
The encoding stage is when students attend lecture 
and record lecture notes whereas, the storage phase is 
when students review their notes. To achieve optimum 
performance, both the encoding and storage processes 
should be combined. The reviewing of lecture notes 
is significantly important especially when conducted 
in close proximity to an exam. Additionally, students 
should also monitor their individual progress and un- 
derstanding of information before an exam. This can be 
achieved by carrying out self -testing. This method can 
also be encouraged by the lecturer providing relevant 
material for example, past exam papers. 

Students with greater memory-ability benefit from 
note-taking with studies finding students with lower 
memory-ability record a lower number of words and 
complete ideas (Kiewra & Benton, 1988). This is due 
to variations within the working memory as information 
is stored and manipulated there. Therefore, the ability 
of note-takers to pick out relevant details, maintain 
knowledge and combine new knowledge to existing 
knowledge are essential factors. Human memory can 
be broken down into three types; Short-term; long-term 
and sensory memory. Short-term memory allows in- 
formation to be stored for a short period before being 
forgotten or transferred to long-term memory. Long- 
term memory endures information for a longer period 
into the memory circuit. The brain circuit includes the 
cerebral cortex and hippocampus. Finally, the sensory 
memory is the initial storage of information lasting an 
instant, consisting of visual and auditory memories. 
Information here is typically gathered through the sight 
and sound senses. 

The incorporation of multimedia within education 
can enhance the learning experience in a number of 
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ways. Significant sensory aids can be provided, interac- 
tivity can be increased and a richer learning experience 
can be initiated. The presentation of learning material 
and the manner in which content is captured is impor- 
tant. This is because if during a lecture organisational 
cues are explicitly defined the organisational process 
is guided (Titsworth & Kiewra, 2004). Moreover, 
the use of non-speech audio amalgamated within 
user-interfaces is becoming increasingly popular. If 
complimented with visual output it can increase the 
amount of information communicated to the user. A 
typical student captures 29% visual, 34% auditory and 
37% haptic metaphors (Dry den & Vos, 1999). Sound 
provides greater flexibility as it can be heard 360° 
without having to concentrate; this is in comparison to 
visual output where the retina subtends an angle of 2° 
around the point of fixation. Consequently sound is a 
superb way of capturing user attention. Furthermore, 
graphical displays including icons, menu graphics and 
so forth can be used as an iconic representation of a 
user's action. 



FUTURE TRENDS 



CONCLUSION 

This article provides a comparative study into note-tak- 
ing techniques, the processes involved and the effects 
of cognition upon learning. The literature suggests 
current electronic learning tools amalgamate many 
learning functions, including lecture notes, handouts, 
discussions and so forth into one application. Integra- 
tion issues including cost factors, hardware deficiencies 
and malfunctions in subcomponents leading to system 
failures are major issues brought to light. Moreover, 
the primary focus into the skill and technique of note- 
taking have been analysed with no such technique 
provided as an electronic tool. Therefore, the design 
and development of a sub-component of e-learning, 
a note-taking tool is being considered based upon the 
studied research. The Cornell note-taking method has 
been arguably more effective with regards to simplic- 
ity, ability to deploy it within any subject area and 
appropriateness. Therefore future work will consider 
the influences of multimedia upon this technique, in- 
cluding earcons, visual structure, layout, and possibly 
the incorporation of speech. 




The increase in electronic learning tools and e-learning 
environments within education are ever-increasing and 
so the necessity to derive a flexible learning structure 
with interactivity is paramount. Studies relating to 
the computerisation of note-taking tools are not so 
prominent. Nonetheless, the comparative study does 
demonstrate the effectiveness of using a note-taking 
method, especially the Cornell method which is most 
popular. Additionally, a popularity trend amongst 
multi-modality including audio sounds in particular 
earcons have shown, there is potential to combine these 
within learning to benefit from the experience and en- 
hance interactivity. Moreover, in-depth research into 
psychological learning parameters especially encoding 
and storage processes have demonstrated the need to 
combine the two together to optimise performance. The 
introduction of annotation applications has instigated a 
trend towards electronic learning platforms although, 
usability issues to personalise the student experience 
must be further studied. 
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KEY TERMS 

Annotation: The activity of briefly describing or 
explaining information. It can also involve summaris- 
ing or evaluating content. 

Cognitive Psychology: A study into cognition such 
as mental processes describing human behaviour, un- 
derstanding perceptions, examining memory, attention 
span, concentration, and forgetfulness. The purpose 
of understanding humans and the way they mentally 
function. 

Information Processing: The ability to capture, 
store and manipulate information. This consists of 
two main processes; encoding and storage. Students 
record notes during the encoding stage and conduct 
reviewing thereafter, in the storage phase. 

Metacognition: The ability and skills of learners to 
be aware of and monitor their learning processes. 

Multimodality : An electronic system that enhances 
interactivity by amalgamating audio, visual and speech 
metaphors. 
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Pedagogical Principles: Key issues to instruct 
the design and development of an electronic learning 
environment. 

Platform: Computer framework allowing software 
to run for a specific purpose. 

Traditional Manner of Study: Typically a class- 
room environment with the tutor writing content on a 
blackboard and students using pen and paper to record 
the content as their own notes. The tutor dominates the 
classroom environment unlike in electronic learning 
where, the user has a sense of control due to flexibility 
in learning. 
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INTRODUCTION 

Simulated annealing is one of the most important 
metaheuristics or general-purpose algorithms of com- 
binatorial optimization, whose properties of conver- 
gence towards high quality solutions are well known, 
although with a high computational cost. Due to that, 
it has been produced a quite number of research works 
on the convergence speed of the algorithm, especially 
on the treatment of the temperature parameter, which is 
known as cooling schedule or strategy. In this article 
we make a comparative study of the performance of 
simulated annealing using the most important cooling 
strategies (Kirkpatrick, S., Gelatt, CD. & Vecchi, M.R, 
1983), (Dowsland, K.A., 2001), (Luke, B.T., 1995), 
(Locatelli, M., 2000). Two classical problems of com- 
binatorial optimization are used in the practical analysis 
of the algorithm: the travelling salesman problem and 
the quadratic assignment problem. 



BACKGROUND 

The main aim of combinatorial optimization is the 

analysis and the algorithmic solving of constrained 
optimization problems with discrete variables. Prob- 
lems that require algorithms of non-polynomial time 
complexity with respect to the problem size, called 
NP-complete problems, are the most important ones. 
The general solving techniques of this type of 
problems belong to three different, but related, re- 
search fields. First, we can mention heuristic search 
algorithms, such as the deterministic algorithms of 
local search (Johnson, D.S., Papadimitriou, C.H. & 
Yannakakis, M., 1985), (Aarts, E.H.L. & Lenstra, J., 
1997), the stochastic algorithm of simulated annealing 
(Kirkpatrick, S., Gelatt, CD. & Vecchi, M.P., 1983), 
and the taboo search (Glover, R, 1986). A second 



kind of solving techniques are algorithms inspired in 
genetics and the evolution theory, such as genetic and 
evolutionary algorithms (Holland, J.H., 1973), (Gold- 
berg, D.E., 1989), and memetic algorithms (Moscato, 
P., 1999). Finally, due to the collective computation 
properties of some neural models, the area of artificial 
neural networks has contributed a third approach, al- 
though possibly not so relevant as the former ones, to 
the combinatorial optimization problem solving with 
the Hopfieldnets (Hopfield, J.J. & Tank, D., 1985), the 
Boltzmann machine (Aarts, E.H.L. & Korst, J., 1989), 
and the self -organizing map (Kohonen, T., 1988). 

Simulated Annealing Algorithm 

The simulated annealing is a stochastic variant of the 
local search that incorporates a stochastic criterion 
of acceptance of worse quality solutions, in order to 
prevent the algorithm from being prematurely trapped 
in local optima. This acceptance criterion is based on 
the Metropolis algorithm for simulation of physi- 
cal systems subject to a heat source (Metropolis, N., 
Rosenbluth, A., Rosenbluth, M., Teller, A. & Teller, 
E., 1953). 

Algorithm (simulated annealing). Be a combina- 
torial optimization problem (X,S,f,R), with generator 
function of random k-neighbour feasible solutions 

g : S x [0,1[^ S . Supposing, without loss of generality, 
that f must be minimized, the simulated annealing 
algorithm can be described in the following way: 

1. Set an initial random feasible solution as current 
solution, s. = s n . 

' i 

2. Set initial temperature or control parameter T = 

3. Obtain a new solution that differs from the current 
one in the value of k variables using the generator 
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function of random k-neighbour feasible solutions, 
s. = g(s., random[0,l[). 

If the new solution s. is better than the current 
one, f(s.) < f(s.), then s. is set as current solution, 
s. = s.. Otherwise, if 



f(Si)-f^j) 



random [0,l[ 



s.. 



s. is equally accepted as current solution, s. 

5. If the number of executed state transitions (steps 
3 and 4) for the current value of temperature is 
equal to L, then the temperature T is decreased. 

6. If there are some k-neighbour feasible solutions 
near to the current one that have not been processed 
yet, steps 3 to 5 must be repeated. The algorithm 
ends in case the set of k-neighbour solutions near 
to the current one has been processed completely 
with a probability close to 1 without obtaining any 
improvement in the quality of the solutions. 

The most important feature of the simulated an- 
nealing algorithm is that, besides accepting transitions 
that imply an improvement in the solution cost, it also 
allows to accept a decreasing number of transitions 
that mean a quality loss of the solution. 

The simulated annealing algorithm converges 
asymptotically towards the set of global optimal solu- 
tions of the problem. E. Aarts and J. Korst provide 
a complete proof in their book Simulated Annealing 
and Boltzmann Machines: A Stochastic Approach to 
Combinatorial Optimization and Neural Computing 
(1989). Essentially, the convergence condition towards 
global optimum sets that temperature T of the system 
must be decreased logarithmically according to the 
equation: 



Tv = 



l+Log(l + k) 



(1) 



where k = 0,1,. ..,n indicates the temperature cycle. 
However, this function of system cooling requires a 
prohibitive computing time, so it is necessary to con- 
sider faster methods of temperature decrease. 

Cooling Schedules 

A practical simulated annealing implementation 
requires generating a finite sequence of decreasing 



values of temperature T, and a finite number L of state 
transitions for each temperature value. To achieve this 
aim, a cooling schedule must be specified. 

The following cooling schedule, frequently used in 
the literature, was proposed by Kirkpatrick, Gelatt and 
Vecchi (1983), and it consists of three parameters: 

Initial temperature, T Q . The initial value of tem- 
perature must be high enough so that any new 
solution generated in a state transition should be 
accepted with a certain probability close to 1. 
Temperature decrease function. Generally, an 
exponential decrease function is used, such as 

T k = T • a k > where a is a constant smaller than 
the unit. Usual values of a fluctuate between 0.8 
and 0.99. 

Number of state transitions, L, for each tempera- 
ture value. Intuitively, the number of transitions 
for each temperature must be high enough so that, 
if no solution changes were accepted, the whole 
set of k-neighbour feasible solutions near to the 
current one could be gone round with a probability 
close to 1. 

The initial temperature, T , and the number of state 
transitions, L, can be easily obtained. On the other hand, 
the temperature decrease function has been studied in 
numerous research works (Laarhoven, P.J.M. Van & 
Aarts, E.H.L., 1987), (Dowsland, K.A., 2001), (Luke, 
B.T., 2005), (Locatelli, M., 2000). 



MAIN FOCUS OF THE CHAPTER 

In this section nine different cooling schedules used in 
the comparison of the simulated annealing algorithm 
are described. They all consist of, at least, three pa- 
rameters: initial temperature T , temperature decrease 
function, and number of state transitions L for each 
temperature. 

Multiplicative Monotonic Cooling 

In the multiplicative monotonic cooling, the system 
temperature T at cycle k is computed multiplying the 
initial temperature T by a factor that decreases with 
respect to cycle k. Four variants are considered: 




345 



A Comparison of Cooling Schedules 



Exponential multiplicative cooling (Figure 1- 
A), proposed by Kirkpatrick, Gelatt and Vecchi 
(1983), and used as reference in the comparison 
among the different cooling criteria. The tem- 
perature decrease is made multiplying the initial 
temperature T Q by a factor that decreases expo- 
nentially with respect to temperature cycle k: 



TV=T n 



•a 



(0.8<a<0.9) 



(2) 



Logarithmical multiplicative cooling (figure 1 -B), 
based on the asymptotical convergence condition 
of simulated annealing (Aarts, E.H.L. & Korst, 
J., 1989), but incorporating a factor a of cooling 
speeding-up that makes possible its use in practice. 
The temperature decrease is made multiplying the 
initial temperature T by a factor that decreases 
in inverse proportion to the natural logarithm of 
temperature cycle k: 



Linear multiplicative cooling (Figure 1-C). The 
temperature decrease is made multiplying the 
initial temperature T by a factor that decreases in 
inverse proportion to the temperature cycle k: 



T^ 



L o 



1+ak 



(a>0) 



(4) 



Quadratic multiplicative cooling (Figure 1-D). 
The temperature decrease is made multiplying the 
initial temperature T by a factor that decreases in 
inverse proportion to the square of temperature 
cycle k: 



L o 



1 + ak' 



(a>0) 



(5) 



Additive Monotonic Cooling 



l + aLog(l + k) 



(col) (3) 



In the additive monotonic cooling, we must take into 
account two additional parameters: the number n of 
cooling cycles, and the final temperature T n of the sys- 
tem. In this type of cooling, the system temperature T 
at cycle k is computed adding to the final temperature 



Figure 1. Multiplicative cooling curves: (A) Exponential, (B) logarithmical, (C) linear, (D) quadratic 
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T n a term that decreases with respect to cycle k. Four 
variants based on the formulae proposed by B. T. Luke 
(2005) are considered: 

Linear additive cooling (Figure 2-A). The tem- 
perature decrease is computed adding to the final 
temperature T n a term that decreases linearly with 
respect to temperature cycle k: 



final temperature T n a term that decreases in 
inverse proportion to the e number raised to the 
power of temperature cycle k: 




T k =T n +(T -T n ) 



1 



2Ln(T -T n ) 



i+e 



fc-in) 

(8) 



T n +(T 



n-k 



(6) 
Quadratic additive cooling (Figure 2-B). The 
temperature decrease is computed adding to 
the final temperature T n a term that decreases in 
proportion to the square of temperature cycle k: 



Trigonometric additive cooling (Figure 2-D). 
The temperature decrease is computed adding to 
the final temperature T n a term that decreases in 
proportion to the cosine of temperature cycle k: 



T k =T n +i(T -T n jl + 



cos 



k n ) 



(9) 



T k =T n +(T -T n 



n-k 



(7) 



Exponential additive cooling (Figure 2-C). The 
temperature decrease is computed adding to the 



Non-Monotonic Adaptive Cooling 

In the non-monotonic adaptive cooling, the system 
temperature T at each state transition is computed 
multiplying the temperature value T k , obtained by any 
of the former criteria, by an adaptive factor \i based on 
the difference between the current solution objective, 



Figure 2: Additive cooling curves: (A) linear, (B) quadratic, (C) exponential, (D) trigonometric. 
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Figure 3. Curve of non-monotonic adaptive cooling 




f(s.), and the best objective achieved until that moment 
by the algorithm, noted f*: 



different instances of the euclidean TSP. The first one 
obtains a tour of 47 European cities and the second 
one a tour of 80 cities. 

Quadratic Assignment Problem. The Quadratic 
Assignment Problem, QAP, consists of finding the 
optimal location of n workshops in p available places 
( p > n ), considering that between each two shops a 
specific amount of goods must be transported with a 
cost per unit that is different depending on where the 
shops are. The objective is minimizing the total cost of 
goods transport among the workshops. The objective 
function f to minimize is given by the expression: 



T = ^iT k 



f(si) 



T k 



(10) 



Note that the inequality 1 < |u< 2 is verified. This 
factor jlx means that the greater the distance between 
current solution and best achieved solution is, the greater 
the temperature is, and consequently the allowed energy 
hops. This criterion is a variant of the one proposed by 
M. Locatelli (2000), and it can be used in combination 
with any of the former criteria to compute T k . In the 
comparison, the standard exponential multiplicative 
cooling has been used for this purpose. So the cooling 
curve is characterized by a fluctuant random behaviour 
comprised between the exponential curve defined by 
T k and its double value 2T k (Figure 3). 

Combinatorial Optimization Problems 
Used in the Comparison 

Travelling Salesman Problem. The Travelling Sales- 
man Problem, TSP, consists of finding the shortest 
cyclic path to travel round n cities so that each city is 
visited only once. The objective function f to minimize 
is given by the expression: 



f(s) = Xd(x i ,x( imodn ) fl ) 



i=l 



where each variable x. means the city that is visited at 
position i of the tour, and d(x.,x.) is the distance between 
the cities x. and x . The tests have been made using two 



n-l n , v 

f ( s )= Z X^i'XjW'O, 
i=l]=i+l 

where each variable x. means the place in which 
workshop i is located, c(x.,x.) is the cost per unit of 
goods transport between the places where shops z and 
j are, and q(i,j) is the amount of goods that must be 
transported between these shops. The tests have been 
made using two different instances of the QAP: The 
first one with 47 workshops to be located in 47 Euro- 
pean cities, and the second one with 80 workshops to 
be located in 80 cities. 

Selection of parameters 

For each problem instance, all variants of the algo- 
rithm use the same values of initial temperature T and 
number L of state transitions for each temperature. The 
initial temperature T must be high enough to accept 
any state transition to a worse solution. The number L 
of state transitions must guarantee with a probability 
r\ close to 1 that, if no solution changes are accepted, 
any k-neighbour solution near to the current one could 
be process. 

In order to determine the other temperature decrease 
parameters of each cooling schedule under similar 
conditions of execution time, we consider the mean 
final temperature T and the temperature standard 
error a of the exponential multiplicative cooling of 
Kirkpatrick, Gelatt and Vecchi with decreasing factor 
a = 0.95. The objective is to determine the temperature 
decrease parameters in such a way that a temperature 

in the interval |T - a, T + a J is reached in the same 
number of cycles as the exponential multiplicative 
cooling. We distinguish three cases: 
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Multiplicative monotonic cooling. The decrease 
factor a that appears in the temperature decrease 
equations must allow to reach the temperature 

T + a , or highest temperature from which the 
end of the algorithm is very probable, in the same 
number n of cycles as the exponential multiplica- 
tive cooling. Knowing that for this cooling the 
number n of cycles is: 



Log 



T + a = T -(0.95) n 



T + a 



L o ; 



Log(0.95) 
(11) 



it results for the logarithmic multiplicative cool- 
ing: 



1 



l+aLog(l + n) 



a 



_Tq 

T + ct 
Log(l + n) 



(12) 



for the linear multiplicative cooling: 
To 



T + a = ^-^a=T±IL 



1 



1 + an 



(13) 



and for the quadratic multiplicative cooling: 



T + a = ^^^a=T±^ 



(14) 



1+an' 



Additive monotonic cooling. Final temperature is 

T n = T - a , and number n of temperature cycles 
is equal to the corresponding number of cycles 
of the exponential multiplicative cooling for that 
final temperature, that is: 



Log 



T-a = T -(0.95) n =^n = 



T-g 

T o ; 



Log(0.95) 
(15) 



Non-monotonic adaptive cooling. As the adap- 
tive cooling is combined in the tests with the 
exponential multiplicative cooling, the decrease 
factor a that appears in the temperature decrease 
equation must be also a = 0.95. 

Analysis of Results 

Table 1 shows the cooling parameters used on each 
instance of the TSP and QAP problems. 

For each instance of the problems 100 runs have 
been made with the nine cooling schedules, computing 
minimum, maximum and average values, and standard 
error both of the objective function and of the number 
of iterations. For limited space reasons we only provide 
results for the QAP 80-workshops instance (Table 2). 
Local search results are also included as reference. 

Considering both obj ective quality and number of it- 
erations, we can conclude that the best cooling schedule 
we have studied is the non-monotonic adaptive cooling 
schedule based on the one proposed by M. Locatelli 
(2000), although without significant differences with 
respect to the exponential multiplicative and quadratic 
multiplicative cooling schedules. 



FUTURE TRENDS 

Complying with the former results we propose some 
research continuation lines on simulated annealing 
cooling schedules: 

Analyse the influence of the initial temperature in 
the performance of the algorithm. An interesting 
idea could be valuing the algorithm behaviour 
when the temperature is initialized with a percent- 
age between 10% and 90% of the usual estimated 
value for T Q . 

Determine an optimal monotonic temperature 
decrease curve, based on the exponential multipli- 
cative cooling. A possibility could be changing the 
parameter a dynamically in the temperature decrease 
equation, with lower initial values (a = 0.8) and 
final values closer to 1 ( a = 0.9 ), but achieving an 
average number of temperature cycles equal to the 
standard case with constant parameter a = 0.95. 
Study new non-monotonic temperature decrease 
methods, combined with the exponential mul- 
tiplicative cooling. These methods could be 
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Table 1. Parameters of the cooling schedules 





TSP47 

T =18000 
L = 3384 


TSP80 

T =110000 
L = 9720 


QAP47 

T = 3000000 
L = 3384 


QAP80 

T = 25000000 
L = 9720 


Exp M SA 


a = 0.95 


a = 0.95 


a = 0.95 


a = 0.95 


Log M SA 


a = 267.24 


a = 2307.6 


a = 657.19 


a = 1576.64 


Lin M SA 


a = 9.45 


a = 65.76 


a = 21.08 


a = 46.37 


Qua M SA 


a = 0.0675 


a = 0.3593 


a = 0.13344 


a = 0.2635 


Additive 

SA 


n=156 
T n = 6.06 


n = 209 
T n = 2.46 


n = 181 
T n = 276.5 


n=197 
T n = 1023 


Adapt SA 


a = 0.95 


a = 0.95 


a = 0.95 


a = 0.95 



Table 2. Results for QAP 80 







Min 


Max 


Average 


Std. Error 


LS 


Objective 


251313664 


254639411 


252752757.8 


716436.9 


Iterations 


27521 


81118 


46850.9 


12458.1 


Exp M SA 


Objective 


249876139 


251490352 


250525795.8 


344195.3 


Iterations 


684213 


2047970 


1791246.1 


170293.2 


Log M SA 


Objective 


250455151 


253575468 


251745332.1 


666407.2 


Iterations 


120248 


2023073 


696470.2 


394540.2 


Lin M SA 


Objective 


249896097 


251907025 


250635912.2 


391761.6 


Iterations 


653162 


2250714 


1428446.1 


324475.7 


Qua M SA 


Objective 


249847174 


251611701 


250470415.8 


350861.7 


Iterations 


1075019 


2614896 


1655619.1 


294039 


Lin A S A 


Objective 


250641396 


253763581 


251874846.8 


591713.9 


Iterations 


1946664 


2552978 


2008897.2 


79027.1 


Qua ASA 


Objective 


250033262 


251800841 


250780492.4 


391747.2 


Iterations 


1909770 


2474159 


1974620 


89007.6 


Exp A SA 


Objective 


249833632 


251808711 


250665143 


379160.5 


Iterations 


1799372 


2939097 


2080203.9 


222456.5 


Trig A S A 


Objective 


250053171 


252148088 


250908285.3 


431834.7 


Iterations 


1919964 


2458885 


1974441.2 


77503.3 


Adapt SA 


Objective 


249902310 


251553959 


250481008.4 


361964.3 


Iterations 


1564222 


2455305 


1803218.6 


129649.9 
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based, as the Locatelli's one in the comparison, 
on modifying the temperature according to the 
distance from the current objective to a reference 
best objective. 

Although these three lines of work are independent, 
it seems to be clear that the ultimate objective would 
be integrating the results achieved with these lines, in 
order to build a high quality combinatorial optimization 
metaheuristic based on simulated annealing. 



CONCLUSION 

The main conclusions that can be drawn from the com- 
parison among the cooling schedules of the simulated 
annealing algorithm are: 

Considering obj ective quality related to the shape 
of the temperature decrease curve, we can affirm 
that simulated annealing works properly with 
respect to the ability of escape from local minima 
when the curve has a moderate slope at the initial 
and central parts of the processing, and softer at 
the final part of it, just as it occurs in the standard 
exponential multiplicative cooling. The specific 
shape (convex, sigmoid) of the curve in the initial 
and central parts does not seem outstanding. 
Considering execution time, the standard error 
of the number of iterations seems to be related 
to the temperature decrease curve tail at the final 
part of the algorithm. An inversely logarithmic 
tail produces a softer final temperature fall and a 
higher standard error, while inversely quadratic 
and exponential tails cancel out faster, providing 
the best standard error values of the algorithm. 
Considering the use of a non-monotonic tem- 
perature decrease method, we can affirm that not 
only the utilized criterion does not make worse 
the general performance of the algorithm but it 
seems to have a favourable effect that deserves 
to be taken into account and studied in greater 
depth. 
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KEY TERMS 

Combinatorial Optimization: Area of the opti- 
mization theory whose main aim is the analysis and 
the algorithmic solving of constrained optimization 
problems with discrete variables. 

Cooling Schedule: Temperature control method in 
the simulated annealing algorithm. It must specify the 
initial temperature T , the finite sequence of decreasing 
values of temperature, and the finite number L of state 
transitions for each temperature value. 

Genetic and Evolutionary Algorithms: Genetic 
Algorithms (GAs) are approximate optimization al- 
gorithms inspired on genetics and the evolution 
theory. The search space of solutions is seen as a set 



of organisms grouped into populations that evolve in 
time by means of two basic techniques: crossover and 
mutation. Evolutionary Algorithms (EAs) are especial 
genetic algorithms that only use mutation as organism 
generation technique. 

Local Search: Local search (LS) is a metaheuristic 
or general class of approximate optimization algorithms, 
based on the deterministic heuristic search technique 
called hill-climbing. 

Memetic Algorithms: Memetic Algorithms (MAs) 
are optimization techniques based on the synergistic 
combination of ideas taken from other two metaheuris- 
tics: genetic algorithms and local search. 

Simulated Annealing: Simulated Annealing (SA) 
is a variant of the metaheuristic of local search that in- 
corporates a stochastic criterion of acceptance of worse 
quality solutions, in order to prevent the algorithm from 
being prematurely trapped in local optima. 

Taboo Search: Taboo Search (TS) is a metaheuristic 
superimposed on another heuristic (usually local search 
or simulated annealing) whose aim is to avoid search 
cycles by forbidding or penalizing moves which take 
the solution to points previously visited in the solu- 
tion space. 
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INTRODUCTION 

In recent years, the notion of complex systems proved 
to be a very useful concept to define, describe, and 
study various natural phenomena observed in a vast 
number of scientific disciplines. Examples of scientific 
disciplines that highly benefit from this concept range 
from physics, mathematics, and computer science 
through biology and medicine as well as economy, to 
social sciences and psychology. Various techniques were 
developed to describe natural phenomena observed in 
these complex systems. Among these are artificial life, 
evolutionary computation, swarm intelligence, neural 
networks, parallel computing, cellular automata, and 
many others. In this text, we focus our attention to one 
of them, i.e. 'cellular automata'. 

We present a truly discrete modelling universe, 
discrete in time, space, and state: Cellular Automata 
(CAs) (Sloot & Hoekstra, 2007, Kroc, 2007, Sloot, 
Chopard & Hoekstra, 2004). It is good to emphasize 
the importance of CAs in solving certain classes of 
problems, which are not tractable by other techniques. 
CAs, despite theirs simplicity, are able to describe and 
reproduce many complex phenomena that are closely 
related to processes such as self-organization and 
emergence, which are often observed within the above 
mentioned scientific disciplines. 



BACKGROUND 

We briefly explain the idea of complex systems and 
cellular automata and provide references to a number 
of essential publications in the field. 



Complex Systems 

The concept of complex systems (CSs) emerged simul- 
taneously and often independently in various scientific 
disciplines (Fishwick, 2007, Bak, 1996,Resnick, 1997). 
This could be interpreted as an indication of their uni- 
versality. Despite the diversity of those fields, there 
exist a number of common features within all complex 
systems. Typically a complex system consist of a vast 
number of simple and locally operating parts, which are 
mutually interacting and producing a global complex 
response. Self-organization (Bak, 1996) and emergence, 
often observed within complex systems, are driven by 
dissipation of energy and/or information. 

Self-organization can be easily explained with ant- 
colony behavior studies where a vast number of identi- 
cal processes, called ants, locally interact by physical 
contact or by using pheromone marked traces. There 
is no leader providing every ant with information or 
instructions what it should do. Despite the lack of such 
a leader or a hierarchy of leaders, ants are able to build 
complicated ant-colonies, feed their larvae, protect the 
colony, fight against other colonies, etc. All this is done 
automatically through a set of simple local interactions 
among the ants. It is well known that ants are respond- 
ing on each stimuli by one out of 20 to 40 (depending 
on ant species) reactions, these are enough to produce 
the observed complexity. 

Emergence is defined as the occurrence of new 
processes operating at a higher level of abstraction then 
is the level at which the local rules operate. Each level 
usually has its own local rules different from rules op- 
erating at other levels. An emergent, like an ant-colony, 
is a product of the process of emergence. There can 
be a whole hierarchy of emergents, e.g. as in the hu- 
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man body, that consists of chemicals and DNA, going 
through polypeptides, proteins, cellular infrastructures 
and cycles, further on to cells and tissues, organs, and 
bodies. We see that self-organization and emergence 
are often closely linked to one another. 

Cellular Automata 

Early development of CAs dates back to A. Turing, 
S. Ulam, and J. von Neumann. We can define CA's 
by four mutually interdependent parts: the lattice and 
its variables, the neighbourhood, and the local rules 
(Toffoli & Margolus, 1987, Toffoli, 1984, Vichniac, 
1 984, Ilachinski, 200 1 , Wolfram, 2002, Wolfram 1 994, 
Sloot & Hoekstra, 2007, Kroc, 2007). This is briefly 
explained below. 

Lattices and Networks 

A lattice is created by a grid of elements, for historical 
reasons called cells, which can be composed in one, 
two, three, or higher dimensional space. The lattice is 
typically composed of uniform cells such as, for instance 
squares, hexagons or triangles in two dimensions. 

CAs operating on networks and graphs represent a 
generalization of classical CAs, which are working on 
regular lattices. Networks can be random or regular. 
Networks can have various topologies, which are clas- 
sified by the degree of regularity and randomness. A 
lattice of cells can be interpreted as a regular network 
of vertices interconnected by edges. When we leave 
this regularity and allow some random neighbours, 
more precisely, if a major part of a network is regular 
and a smaller fraction of it is random, then we enter the 
domain of small-world networks. The idea of small- 
world networks provides a unique tool, which allows 
us to capture many essential properties of naturally 
observed phenomena especially those linked to social 
networks and surprisingly to (metabolic and other) 
networks operating within living cells. Whereas small- 
world networks are a mixture of regular and random 
networks, pure random networks have a completely 
different scope of use. It is worth to mention the concept 
of scale-free networks, which have a connectivity that 
does not depend on scale anymore (Kroc, 2007, Sloot, 
Chopard & Hoekstra, 2004). 



Variables 

A CA contains an arbitrary number of discrete vari- 
ables. The number and range of them are dictated by 
the phenomenon under study. The simplest CAs are 
built using only one Boolean variable in one dimension 
(ID), see e.g. (Wolfram, 2002). Some of such simple 
ID CAs express even high complexity and are shown 
to be capable of the universal computation. 

Neighbourhoods 

The neighbourhood, which is used to evaluate a local 
rule, is defined by a set of neighbouring cells including 
the updated cell itself in the case of regular lattices, 
Figure 1. Neighbours with relative coordinates [i, j+1], 
[i-l,j], [i, j-1], [i+l, j] of the updated cell [i, j] and 
located on North, West, South, and East, respectively, 
define the so called the von Neumann neighbourhood 
with radius r =1. The Moore neighbourhood with ra- 
dius r =1 contains the same cells as the von Neumann 
neighbourhood plus diagonal cells located at relative 
positions [i-1, j+1], [i-1, j-1], [i+l, j-1], [i+l, j+1], i.e. 
North-west, South-west, South-east, and North-east, 
respectively. 

There are many other types of neighbourhoods 
possible; neighbourhoods can even be spatially or 
temporally non-uniform. One example is the Margolus 
neighbourhood, used in diffusion modelling. 

The boundaries for each C A can be fixed, reflecting 
or periodic. Periodic boundary conditions represent 
infinite lattices. Periodic means that, e.g. in one dimen- 
sion, the most right cell of a lattice is connected to the 
most left lattice cell. Fixed boundary cells are kept at 
predefined values. Reflecting boundary cells reflect 
values back to the bulk of the lattice. 

Local Rules 

A local rule defines the evolution of each CA. Usu- 
ally; it is realized by taking all variables from all cells 
within the neighbourhood and by evaluation of a set 
of logical and/or arithmetical operations written in the 
form of an algorithm. The vector s of those variables 
is updated according to the following local rule in the 
case of the von Neumann neighbourhood 

s[ij] = f(s[ij+l], s[i-lj], s[ij-l], s[i+lj]), 
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Figure 1. Four types of neighbourhood is shown on the lattice of 5 x 5 cells: (from left) the von Neumann with 
r =1, and r =2, the Moore with r =1, and finally a random one 




where i represents the x coordinate^' represents y co- 
ordinate of the cell, and f the local rule. The updated 
cell has coordinates [ij]. Figure 1 shows a 5x5 two- 
dimensional CA with neighbourhoods having various 
radiuses. 

Modelling 

Computational modeling is defined as a mathematical, 
numerical and/or computational description of a natu- 
rally observed phenomenon. It is essential in situations 
where the observed phenomena are not tractable by 
analytical means. Results are often validated against 
analytical solutions in special or simplified cases. Its 
importance has been shown in physics and chemistry 
and is continuously increasing in new fields such as 
biology, medicine, sociology, and psychology. 



CELLULAR AUTOMATA MODELLING OF 
COMPLEX SYSTEMS 

There is a constant influx of new ideas and approaches 
enriching the CA method. Within CA modelling of 
complex systems, there are distinct streams of research 
and their applications in various disciplines, these are 
briefly discussed in this section. 

Classical cellular automata, with a regular lattice 
of cells, are used to model ferromagnetic and anti-fer- 
romagnetic materials, solidification, static and dynamic 



recrystallization, laser dynamics, traffic flow, escape 
and pedestrian behaviour, voting processes, self-rep- 
lication, self-organization, earthquakes, volcano activ- 
ity, secure coding of information and cryptography, 
immune systems, living cells and tissue behaviour, 
morphological development, ecosystems, and many 
other natural phenomena (Sloot, Chopard & Hoekstra, 
2004, Kroc, 2007, Illachinski, 2001). CAs were first 
used in the modelling of excitable media, such as heart 
tissue. CAs often outperforms other methods as, e.g., 
the Monte-Carlo method, especially for highly dissipa- 
tive systems. The main reason why CAs represents the 
best choice in modelling of many naturally observed 
complex phenomena is because CAs are defined above 
truly spatio-temporally discretized worlds. The inher- 
ent CA properties brings new qualities in models that 
are not principally achievable by other computational 
techniques. 

An example of an advanced CA method is the 
Lattice Boltzmann method consisting of a triangular 
network of vertices interconnected by edges where 
generalized 'liquid particles' move and undergo colli- 
sions according a collision table. A model of a gas is 
created where conservation of mass, momentum and 
energy during collisions are enforced, which produce 
a fully discrete and simplified, yet physically correct 
micro dynamics. When operated in the right limits, they 
reproduce the incompressible Navier-Stokes equations 
and therefore are a model for fluid dynamics. Averaged 
quantities resulting from such simulations correspond 
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to solutions of the Navier-Stokes equations (Sloot & 
Hoekstra, 2007, Rivet & Boon, 2001). 

Classical C As, using lattices, have many advantages 
over other approaches but some known disadvantages 
have to be mentioned. One of the disadvantages of C As 
could be in the use of discrete variables. This restric- 
tion is by some authors removed by use of continuous 
variables, leading to generalized CAs. The biggest 
disadvantage of classical CAs is often found in the 
restricted topology of the lattice. Classical regular 
lattices fail to reproduce properties of many naturally 
observed phenomena. What led to the following de- 
velopment in CAs. 

Generalized cellular automata, Darabos, Giacobini 
& Tomassini in (Kroc, 2007), are built on general 
networks, which are represented by regular, random, 
scale-free networks or small-world networks. A regu- 
lar network can be created from a classical CA and 
its lattice where each cell represents a node and each 
neighbour is linked by an edge. A random graph is 
made from nodes that have randomly chosen nodes as 
neighbours. Within scale-free networks, some nodes are 
highly connected to other points whereas other nodes 
are less connected. Their properties are independent 
of their size. The distribution of degree of links at a 
node follow a power law relationship P(k) = 1c\ where 
P(k) is the probability that a node is connect to k other 
nodes. The coefficient y is in most cases between 2 and 
3. Those networks occur for instance in the Internet, 
in social networks, and in biologically produced net- 
works such as gene regulatory networks within living 
cells or food chains within ecosystems (Sloot, Ivanov, 
Boukhanovsky, van de Vijver & Boucher, 2007). 

In general, the behaviour of a given CAis unpredict- 
able what is often used in cryptography. There exist 
a number of mostly statistical techniques enabling to 
study the behaviour of given CA but none of them is 
exact. The easiest way, and often the only one, to find 
out the state of a CA is its execution. 



CASE STUDIES 

Understanding morphological growth and branching of 
stony corals with the lattice Boltzmann method is a good 
example of studying natural complex system with CAs 
(Kaandorp, Lowe, Frenkel & Sloot, 1996, Kaandorp, 
Sloot, Merks, Bak, Vermeij, & Maier, 2005). A deep 
insight into those processes is important to assess the 



role of corals in marine ecosystems and, e.g., its relation 
to global climate changes. Simulation of growth and 
branching of a coral involves multiphysics processes 
such as, nutrient diffusion, fluid flow, light absorption 
by the zooxanthele that live in symbiosis with the coral 
polyps, as well as mechanical stress. 

It is demonstrated that nutrient gradients determine 
the morphogenesis of branching of phototropic corals. 
In this specific case, we deal with diffusion-limited 
processes fully determining the morphological shape of 
the growing corals. It is known from tank experiments 
and simulation studies that those diffusion dominant 
regions operate for relatively high flow velocities. It has 
been demonstrated that simulated coral morphologies 
are indistinguishable from real corals (Kaandorp, Sloot, 
Merks, Bak, Vermeij, & Maier, 2005), Figure 3. 

Modelling of dynamic recrystallization represents 
another living application of CAs within the field of 
solid state physics (Kroc, 2002). Metals having poly- 
crystalline form, composed from many single crystals, 
are deformed at elevated temperatures. The stored 
energy is increasing due to deformation, which is in 
turn released by recrystallization, where nuclei grow 
and form new grains. Growth is driven by the release 
of stored energy. The response of deformed polycrys- 
talline material is reflected by complex changes within 
the microstructure and deformation curve. 

Stress-strain curves measured during deforma- 
tion of metallic samples exhibits either single peak 
or multiple peak behaviour. This complex response 
of deformed material is a direct result of concurrent 
processes operating within deformed material. CAs, so 
far, represents the only computational technique, which 
is able to describe such complex material behaviour 
(Kroc, 2002), Figure 3. 



FUTURE TRENDS 

There is a number of distinct tracks within CAs re- 
search with a constant flux of new discoveries (Kroc, 
2007, Sloot & Hoekstra, 2007). CAs are used to model 
physical phenomena but they are increasingly used 
to model biological, medical and social phenomena. 
Most CAs are designed by hand but the future requires 
development of automatic and self-adjusting optimiza- 
tion techniques to design local rules according to the 
needs of the described natural phenomena. 
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Figure 2. Morphological growth of coral Mandracis mirabilis obtained through 3D visualization of a CT-scan 
of the coral (top) and two simulated growth forms (bottom) with different morphological shapes are depicted 
(Kaandorp, Shot, Merks, Bak, Vermeij, & Maier, 2005). Simulated structures are indistinguishable from real 
corals. 





It is important to stress that the CA technique is 
bringing a cross-fertilization among many scientific 
disciplines. It happened many times in past that two 
or more very similar techniques were developed in 
distinct scientific fields such as, e.g. physics and social 
science. 

The spatial structure of CAs is evolving from 
regular lattices to networked CAs Darabos, Giacobini, 
Tomassini in (Kroc, 2007), and to multilevel CAs (Hoek- 
stra, Lorentz, Fakone & Chopard, 2007). Updating 
schemes of CAs will address in the future two regimes: 
synchronous (the classical one), and asynchronous 
(Sloot, Overeinder & Schoneveld, 2001). 



CONCLUSIONS 

We briefly discussed complex systems and demonstrate 
the usefulness of cellular automata in modelling those 
systems. It has been shown that cellular automata 
provide a simple but an extremely efficient numerical 
technique, which is able to describe and simulate such 
complicated behaviour as self-organization and emer- 
gence. This extraordinary combination of simplicity and 
expressivity brings a constant flux of new discoveries 
in description of many naturally observed phenomena 
in almost all scientific fields. 
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Figure 3. Simulation of dynamic recrystallization is represented by: stress-strain curves (top-left), relevant mean 
grain size D-strain curves (top-right), an abrupt change loading strain rate (bottom-left), and relevant D-strain 
curves (bottom-right). Strain is represented by the number ofCA steps (Kroc, 2002). 
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Finally, it is good to emphasize that CAs represent 
a generic method often used in the development of 
prototypes of completely new numerical methods de- 
scribing naturally observed phenomena. We believe that 
CAs have a great potential for the future development 
of computational modelling and the understanding of 
the dynamics of complex systems. 
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KEY TERMS 

Cellular Automaton: (plural: cellular automata.) A 
cellular automaton is defined as a lattice (network) of 
cells (automata) where each automaton contains a set 
of discrete variables, which are updated according to 
a local rule operating above neighbours of given cell 
in discrete time steps. Cellular automata are typically 
used as simplified but not simple models of complex 
systems. 

Generalized Cellular Automaton: It is based on 
use of networks instead of regular lattices. 

Complex Network: Most of biological and social 
networks reflect topological properties not observed 
within simple networks (regular, random). Two ex- 
amples are small-world and scale-free networks. 

Complex System: Atypical complex system con- 
sists of a vast number of identical copies of several 
generic processes, which are operating and interacting 
only locally or with a limited number of not necessary 
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close neighbours. There is no global leader or controller 
associated to such systems and the resulting behaviour 
is usually very complex. 

Emergence: Emergence is defined as the occurrence 
of new processes operating at a higher level of abstrac- 
tion then is the level at which the local rules operate. 
A typical example is an ant colony where this large 
complex structure emerges through local interactions 
of ants. For example, a whole hierarchy of emergents 
exists and operates in a human body. An emergent is 
the product of an emergence process. 

Lattice Gas Automata: Typically, it is a triangular 
network of vertices interconnected by edges where 
generalized liquid particles move and undergo colli- 
sions. Averaged quantities resulting from such simu- 
lations correspond to solutions of the Navier-Stokes 
equations. 

Modelling: It is a description of naturally observed 
phenomena using analytical, numerical, and/or compu- 
tational methods. Computational modelling is classi- 
cally used in such fields as, e.g. physics, engineering. 
Its importance is increasing in other fields such as 
biology, medicine, sociology, and psychology. 

Random Network: A neighbourhood of a vertex is 
created by a set of randomly chosen links to neighbour- 
ing vertices (elements) within a network of vertices. 



Regular Lattice: A perfectly regular and uniform 
neighbourhood for each lattice element called cell 
characterizes such lattices. 

Self-Organization: Self -organization is a process 
typically occurring within complex systems where a 
system is continuously fed by energy, which is trans- 
formed into a new system state or operational mode by 
a dissipation of energy and/or information. 

Self-Organized Criticality: A complex system 
expressing SOC is continuously fed by energy where 
release of it is discrete and typically occurs in the form of 
avalanches. Most of its time, SOC operates at a critical 
point where avalanches occur. Earthquakes and volcano 
eruptions represent prototypical examples of SOC ob- 
served in many naturally observed phenomena. 

Small-World Network: A mixture of two differ- 
ent types of connections within each neighbourhood 
characterizes small-worlds. Typically, a neighbourhood 
of given vertex is composed of a greater fraction of 
neighbours having regular short-range connectivity 
(regular network) and a smaller fraction of random 
connections (random network). Such type of neigh- 
bourhood provides unique properties to each model 
built on the top of it. 
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INTRODUCTION 

The usual real-valued artificial neural networks 
have been applied to various fields such as 
telecommunications, robotics, bioinformatics, image 
processing and speech recognition, in which complex 
numbers (two dimensions) are often used with the 
Fourier transformation. This indicates the usefulness 
of complex-valued neural networks whose input and 
output signals and parameters such as weights and 
thresholds are all complex numbers, which are an 
extension of the usual real- valued neural networks. 
In addition, in the human brain, an action potential 
may have different pulse patterns, and the distance 
between pulses may be different. This suggests that it is 
appropriate to introduce complex numbers representing 
phase and amplitude into neural networks. 

Aizenberg, Ivaskiv, Pospelov and Hudiakov (1971) 
(former Soviet Union) proposed a complex-valued neu- 
ron model for the first time, and although it was only 
available in Russian literature, their work can now be 
read in English (Aizenberg, Aizenberg & Vandewalle, 
2000). Prior to that time, most researchers other than 
Russians had assumed that the first persons to propose 
a complex- valued neuron were Widrow, McCool and 
Ball (1975). Interest in the field of neural networks 
started to grow around 1990, and various types of com- 
plex-valued neural network models were subsequently 
proposed. Since then, their characteristics have been 
researched, making it possible to solve some problems 
which could not be solved with the real-valued neuron, 
and to solve many complicated problems more simply 
and efficiently. 



BACKGROUND 

The generic definition of a complex- valued neuron is 
as follows. The input signals, weights, thresholds and 



output signals are all complex numbers. The net input 
U n to a complex-valued neuron n is defined as: 



U n = llWnmXm + Vn 



(1) 



where W nm is the complex- valued weight connecting 
complex-valued neurons n and m, X m is the complex- 
valued input signal from the complex- valued neuron m, 
and V is the complex-valued threshold of the neuron 

n. The output value of the neuron n is given by f c (U ) 
where f c : C^ Cis called activation function (C denotes 
the set of complex numbers). Various types of activation 
functions used in the complex-valued neuron have 
been proposed, which influence the properties of the 
complex-valued neuron, and a complex-valued neural 
network consists of such complex-valued neurons. 

For example, the component-wise activation 
function or real-imaginary type activation function is 
often used (Nitta & Furuya, 1991; Benvenuto & Piazza, 
1992; Nitta, 1997), which is defined as follows: 



fc^-f R ^f R ^ 



(2) 



where f R (u) = l/(l+exp(-i/)), u£ R(R denotes the set 

of real numbers), i denotes V-l > and the net input U n 
is converted into its real and imaginary parts as fol- 
lows: 



U n = 



x + iy - z. 



(3) 



That is, the real and imaginary parts of an output of 
a neuron mean the sigmoid functions of the real part x 
and imaginary part y of the net input z to the neuron, 
respectively. 

Note that the component-wise activation function 
(eqn (2)) is bounded but non-regular as a complex- 
valued function because the Cauchy-Riemann equations 
do not hold. Here, as several researchers have pointed 
out (Georgiou & Koutsougeras, 1992; Nitta, 1997) in 
the complex region, we should recall the Liouville's 
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theorem, which states that if a function g is regular at 
allz £C and bounded, then g is a constant function. That 
is, we need to choose either regularity or boundedness 
for an activation function of complex-valued neurons. 
In addition, it has been proved that the complex-valued 
neural network with the component-wise activation 
function (eqn (2)) can approximate any continuous 
complex-valued function, whereas a network with 
a regular activation function (for example, f c (z) = 
l/(l+exp(-z)) (Kim & Guest, 1990), and f c (z) = tanh 
(z) (Kim & Adali, 2003)) cannot approximate any non- 
regular complex-valued function (Arena, Fortuna, Re 
& Xibilia, 1993; Arena, Fortuna, Muscato & Xibilia, 
1998). That is, the complex-valued neural network 
with the non-regular activation function (eqn (2)) is a 
universal approximator, but a network with a regular 
activation function is not. It should be noted here that 
the complex-valued neural network with a regular 
complex-valued activation function such as f c (z) = 
tanh (z) with the poles can be a universal approximator 
on the compact subsets of the deleted neighbourhood 
of the poles (Kim & Adali, 2003). This fact is very 
important theoretically, however, unfortunately the 
complex-valued neural network for the analysis is not 
usual, that is, the output of the hidden neuron is defined 
as the product of several activation functions. Thus, the 
statement seems to be insufficient to compare with the 
case of component-wise complex-valued activation 
function. Thus, the ability of complex-valued neural 
networks to approximate complex-valued functions 
depends heavily on the regularity of activation func- 
tions used. 

On the other hand, several complex-valued activa- 
tion functions based on polar coordinates have been 
proposed. For example, Hirose (1992) proposed the 
following amplitude-phase type activation function: 



and radar. Aizenberg et al. (2000) proposed the 
following activation function: 



/>)= 



tanh 



a 



m 



•exp(zp)z=a-exp(/p), 
(4) 



where m is a constant. Although this amplitude-phase 
activation function is not regular, Hirose noted that the 
non-regularity did not cause serious problems in real 
applications and that the amplitude-phase framework 
is suitable for applications in many engineering fields 
such as optical information processing systems, and 
amplitude modulation, phase modulation and frequency 
modulation in electromagnetic wave communications 



^ Ai 2nj 



<(3< 



2ti (y+1) 



^ (Z) = 6XP l kj k • k 

z=a-exp(/p) ; y = o,l,...,fc-l 



(5) 



where k is a constant. Eqn (5) can be regarded as a type 
of amplitude-phase activation functions. Only phase 
information is used and the amplitude information is 
discarded, however, many successful applications show 
that the activation function is sufficient. 



INHERENT PROPERTIES OF THE 
MULTI-LAYERED TYPE 
COMPLEX-VALUED NEURAL NETWORK 

This article presents the essential differences between 
multi-layered type real-valued neural networks and 
multi-layered type complex-valued neural networks, 
which are very important because they expand the real 
application fields of the multi-layered type complex- 
valued neural networks. To the author's knowledge, 
the inherent properties of complex-valued neural 
networks with regular complex-valued activation 
functions have not been revealed except their learning 
performance so far. Thus, only the inherent properties 
of the complex- valued neural network with the non- 
regular complex-valued activation function (eqn (2)) 
are mainly described: (a) the learning performance, (b) 
the ability to transform geometric figures, and (c) the 
orthogonal decision boundary. 

Learning Performance 

In the applications of multi-layered type real-valued 
neural networks, the error back-propagation learning 
algorithm (called here, Real-BP) (Rumelhart, Hinton 
& Williams, 1986) has often been used. Naturally, the 
complex-valued version of the Real-BP (called here, 
Complex-BP) can be considered, and was actually pro- 
posedby several researchers (Kim & Guest, 1 990; Nitta 
& Furuya, 1991; Benvenuto & Piazza, 1 992; Georgiou 
&Koutsougeras, 1992; Nitta, 1993, 1997; Kim & Adali, 
2003). This algorithm enables the network to learn 
complex-valued patterns naturally. 
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It is known that the learning speed of the Complex- 
BP algorithm is faster than that of the Real-BP algorithm. 
Nitta (1991, 1997) showed in some experiments on 
learning complex- valued patterns that the learning speed 
is several times faster than that of the conventional 
technique, while the space complexity (i.e., the number 
of learnable parameters needed) is only about half that 
of Real-BP. Furthermore, De Azevedo, Travessa and 
Argoud (2005) applied the Complex-BP algorithm 
of the literature (Nitta, 1997) to the recognition and 
classification of epileptiform patterns in EEG, in par- 
ticular, dealing with spike and eye-blink patterns, and 
reconfirmed the superiority of the learning speed of 
the Complex-BP described above. As for the regular 
complex-valued activation function, Kim and Adali 
(2003) compared the learning speed of the Complex-BP 
using nine regular complex-valued activation functions 
with those of the Complex-BP using three non-regular 
complex-valued activation functions (including eqn (4)) 
through a computer simulation for a simple nonlinear 
system identification example. The experimental 
results suggested that the Complex-BP with the regular 

activation function f (z) = arcsin h(z) was the fastest 
among them. 

Ability to Transform Geometric Figures 

The Complex-BP with the non-regular complex- valued 
activation function (eqn (2)) can transform geometric 
figures, e.g. rotation, similarity transformation and 
parallel displacement of straight lines, circles, etc., 
whereas the Real-BP cannot (Nitta, 1991, 1993, 1997). 
Numerical experiments suggested that the behaviour 
of a Complex-BP network which learned the transfor- 
mation of geometric figures was related to the Identity 
Theorem in complex analysis. 

Only an illustrative example on a rotation is given 
below. In the computer simulation, a 1-6-1 three-lay- 
ered complex-valued neural network was used, which 
transformed a point (x, y) into (x\ y') in the complex 
plane. Although the Complex-BP network generates 
a value z within the range < Re[z], Im[z] < 1 due to 
the activation function used (eqn (2)), for the sake of 
convenience it is presented in the figure given below 
as having a transformed value within the range - 1 < 
Re[z], Im[z] < 1. The learning rate used in the experi- 
ment was 0.5. The initial real and imaginary components 
of the weights and the thresholds were chosen to be 



random real numbers between and 1. The experiment 
consisted of two parts: a training step, followed by a 
test step. The training step consisted of learning a set 
of (complex-valued) weights and thresholds, such that 
the input set of (straight line) points (indicated by black 
circles in Fig. 1) gave as output, the (straight line) points 
(indicated by white circles) rotated counterclockwise 
over tt/2 radians. Input and output pairs were presented 
1,000 times in the training step. These complex-val- 
ued weights and thresholds were then used in a test 
step, in which the input points lying on a straight line 
(indicated by black triangles in Fig. 1) would hope- 
fully be mapped to an output set of points lying on 
the straight line (indicated by white triangles) rotated 
counterclockwise over tt/2 radians. The actual output 
test points for the Complex-BP did, indeed, lie on the 
straightline (indicated by white squares). Itappears that 
the complex-valued network has learned to generalize 
the transformation of each point Z k (= r/texptzOj) into 
Zkexp[zal=r k exp[z(e k +a)]), i.e., the angle of each 
complex-valued point is updated by a complex-valued 
factor exp[za], however, the absolute length of each 
input point is preserved. In the above experiment, the 
11 training input points lay on the line y = - x + 1 (0 < 
x < 1) and the 11 training output points lay on the line 
y = x + 1 (- 1 < x < 0). The seven test input points lay 
on the line y=0.2 (- 0.9 < x < 0.3). The desired output 
test points should lie on the line x = - 0.2. 

Watanabe, Yazawa, Miyauchi and Miyauchi ( 1 994) 
applied the Complex-BP in the field of computer 
vision. They successfully used the ability to transform 
geometric figures of the Complex-BP network to 
complement the 2D velocity vector field on an image, 
which was derived from a set of images and called an 
optical flow. The ability to transform the geometric 
figure of the Complex-BP can also be used to generate 
fractal images. Actually, Miura and Aiyoshi (2003) 
applied the Complex-BP to the generation of fractal 
images and showed in computer simulations that some 
fractal images such as snow crystals could be obtained 
with high accuracy where the iterated function systems 
(IFS) were constructed using the ability to transform 
geometric figure of the Complex-BP. 

Orthogonal Decision Boundary 

The decision boundary of the complex-valued neu- 
ron with the non-regular complex-valued activation 
function (eqn (2)) has a different structure from that 
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of the real-valued neuron. Consider a complex-valued 
neuron with M input neurons. Let the weights denote 
w = t [w 1 w M ] = w r +i w\ w r = l [w/ w M r ], w z = ^w/ 
w M f ] and let the threshold denote 0= 9 r + iO \ Then, 
for M input signals (complex numbers) z = t [z 1 z M ] 
= x + iy, x= ^ xj, y = t [y 1 yj, the complex- 
valued neuron generates 



X+iY= f 

I R 



[ f w r -V] 



8 'K, 



[V 'w r ] 

(6) 



+e j 



as an output. Here, for any two constants C R , C £ (0, 
1), let 



*(x,y)= f i 



Y(x,y)= f 



['w r -V] 



+e r 



['w' ( w r ] 



+8' 



= C\ 
V) 

C. 
(8) 



Note here that eqn (7) is the decision boundary for 
the real part of an output of the complex-valued neu- 
ron with M inputs. That is, input signals (x, y) £ R 2M 
are classified into two decision regions {(x,y) e R 2M | 
X(x, y) > C R } and {(x, y) G R 2M \ X(x, y) < C R } by the 
hypersurface given by eqn (7). Similarly, eqn (8) is the 
decision boundary for the imaginary part. Noting that 
the inner product of the normal vectors of the decision 
boundaries (eqns (7) and (8)) is zero, we find that the 
decision boundary for the real part of an output of a 
complex-valued neuron and that for the imaginary part 
intersect orthogonally. 

As is well known, the XOR problem and the detection 
of symmetry problem cannot be solved with a single 
real- valued neuron (Rumelhart, Hinton & Williams, 
1986). Contrary to expectation, it was proved that such 
problems could be solved by a single complex-valued 
neuron with the orthogonal decision boundary, which 
revealed the potent computational power of complex- 
valued neural networks (Nitta, 2004a). 



FUTURE TRENDS 

Many application results (Hirose, 2003) such as associa- 
tive memories, adaptive filters, multi-user communica- 
tion and radar image processing suggest directions for 
future research on complex-valued neural networks. 

It is natural that the inherent properties of the 
multi-layered type complex-valued neural network 
with non-regular complex-valued activation function 
(eqn (2)) are not limited to the ones described above. 
Furthermore, to the author's knowledge, the inherent 
properties except the learning performance of recurrent 
type complex-valued neural networks have not been 
reported. The same is also true of the complex-valued 
neural network with regular complex- valued activation 
functions. Such exploration will expand the application 
fields of complex-valued neural networks. 

In the meantime, efforts have already been made 
to increase the dimensionality of neural networks, for 
example, three dimensions (Nitta, 2006), quaternions 
(Arena, Fortuna, Muscato & Xibilia, 1998; Isokawa, 
Kusakabe, Matsui & Peper, 2003; Nitta, 2004b), 
Clifford algebras (Pearson & Bisset, 1992; Buchholz 
& Sommer, 2001), and N dimensions (Nitta, 2007), 
which is a new direction for enhancing the ability of 

neural networks. 



CONCLUSION 

This article outlined the inherent properties of complex- 
valued neural networks, especially those of the case 
with non-regular complex-valued activation functions, 
that is, (a) the learning performance, (b) the ability to 
transform geometric figures, and (c) the orthogonal 
decision boundary. Successful applications of such 
networks were also described. 



REFERENCES 

Aizenberg, I. N., Aizenberg, N. N., & Vandewalle, J. 
(2000). Multi-valued and universal binary neurons. 
Boston: Kluwer Academic Publishers. 

Aizenberg, N. N., Ivaskiv, Yu. L., Pospelov, D. A., & 
Hudiakov, G. F. (1971). Multiple-valued threshold 
functions, boolean complex-threshold functions and 



364 



Complex-Valued Neural Networks 



their generalization. Kibernetika (Cybernetics). 
44-51 (in Russian). 



4, 



Arena, P., Fortuna, L., Muscato, G., & Xibilia, 
M. G. (1998). Neural networks inmultidimensional 
domains. Lecture Notes in Control and Information 
Sciences, 234, London: Springer. 

Arena, P., Fortuna, L., Re, R., & Xibilia, M. G. (1993). 
On the capability of neural networks with complex 
neurons in complex valued functions approximation. 
Proc. IEEE Int. Conf. on Circuits and Systems, 2168- 
2171. 

Benvenuto, N., & Piazza, F. (1992). On the complex 
backpropagation algorithm. IEEE Trans. Signal Pro- 
cessing, 40(4), 967-969. 

Buchholz, S., & Sommer, G. (2001). Introduction to 
neural computation in Clifford algebra. In Sommer, G. 
(Ed.), Geometric computing with Clifford algebras (pp. 
291-314). Berlin Heidelberg: Springer. 

De Azevedo, F. M., Travessa, S. S., & Argoud F. 
I. M. (2005). The investigation of complex neural 
network on epileptiform pattern classification. Proc. 
The 3rd European Medical and Biological Engineering 
Conference (EMBEC05), 2800-2804. 

Georgiou, G. M., &Koutsougeras, C. (1992). Complex 
domain backpropagation. IEEE Trans. Circuits and 
Systems— II: Analog and Digital Signal Processing, 
39(5), 330-334. 

Hirose, A. (1992). Continuous complex-valued back- 
propagation learning. Electronics Letters, 28(20), 
1854-1855. 

Hirose, A. (2003). Complex-valued neural networks, 
Singapore: World Scientific Publishing. 

Isokawa, T. , Kusakabe, T. , Matsui, N. , & Peper, F. (2003). 
Quaternion neural network and its application. In Palade, 
V., Howlett, R. J., & Jain, L. C. (Ed.), Lecture notes 
in artificial intelligence, 2774 (KES2003) (pp. 318-324), 
Berlin Heidelberg: Springer. 

Kim, T., & Adali, T. (2003). Approximation by fully 
complex multilayer perceptrons. Neural Computation, 
15(7), 1641-1666. 

Kim, M. S., & Guest, C. C. (1990). Modification of 
backpropagation networks for complex-valued signal 
processing in frequency domain. Proc. Int. Joint Conf. 
on Neural Networks, 3, 27-31. 



Miura, M., & Aiyoshi, E. (2003). Approximation and 
designing of fractal images by complex neural networks. 
EEJ Trans, on Electronics, Information and Systems, 
123(8), 1465-1472 (in Japanese). 

Nitta, T., & Furuya, T. (1991). A complex back- 
propagation learning. Transactions of Information 
Processing Society of Japan, 32(10), 1319-1329 (in 
Japanese). 

Nitta, T. (1993). A complex numbered version of the 
back-propagation algorithm. Proc. World Congress on 
Neural Networks, 3, 576-579. 

Nitta, T. (1997). An extension of the back-propagation 
algorithm to complex numbers. Neural Networks, 
10(8), 1392-1415. 

Nitta, T. (2004a). Orthogonality of decision boundaries 
in complex- valued neural networks. Neural Computa- 
tion, 16(1), 73-97. 

Nitta, T. (2004b). A Solution to the 4-bit parity problem 
with a single quaternary neuron. Neural Information 
Processing - Letters and Reviews, 5(2), 33-39. 

Nitta, T. (2006). Three-dimensional vector valued 
neural network and its generalization ability. Neural 
Information Processing -Letters and Reviews, 10(10), 
237-242. 

Nitta, T. (2007). iV-dimensional vector neuron. 
Proc. IJCAI Workshop on Complex ValuedNeural 
Networks and N euro-Computing: Novel Methods, Ap- 
plications and Implementations, 2-7. 

Pearson, J., & Bisset, D. (1992). Backpropagation in 
a Clifford algebra. Proc. International Conference on 
Artificial Neural Networks, Brighton. 

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. 
(1986). Parallel distributed processing (Vol.1). MA: 
The MIT Press. 

Watanabe, A., Yazawa, N., Miyauchi, A., & Miyauchi, 
M. (1994). A method to interpret 3D motions using 
neural networks. IEICE Transactions on Fundamentals 
of Electronics, Communications and Computer 
Sciences, E77-A(8), 1363-1370. 

Widrow,B.,McCool,J.,&Ball,M.(1975).Thecomplex 
LMS algorithm. Proceedings of the IEEE. 63(4), 7 1 9- 
720. 




365 



Complex-Valued Neural Networks 



KEY TERMS 

Artificial Neural Network: A network composed 
of artificial neurons. Artificial neural networks can be 
trained to find nonlinear relationships in data. 

Back-Propagation Algorithm : A supervised learn- 
ing technique used for training neural networks, based 
on minimizing the error between the actual outputs and 
the desired outputs. 

Clifford Algebras: An associative algebra, which 
can be thought of as one of the possible generaliza- 
tions of complex numbers and quaternions. 

Complex Number: A number of the form a + ib 
where a and b are real numbers, and i is the imaginary 
unit such that i 2 = - 1. a is called the real part, and b 
the imaginary part. 



Decision Boundary: A boundary which pattern 
classifiers such as the real-valued neural network use 
to classify input patterns into several classes. It generally 
consists of hypersurfaces. 

Identity Theorem: A theorem for regular complex 
functions: given two regular functions f and g on a 
connected open set D, if f = g on some neighborhood 
of z that is in D, then f = g on D. 

Quaternion: A four-dimensional number which is 
a non-commutative extension of complex numbers. 

Regular Complex Function: A complex function 
that is complex-differentiable at every point. 



Figure 1. Rotation of a straight line. A black circle denotes an input training point, a white circle an output train- 
ing point, a black triangle an input test point, a white triangle a desired output test point, and a white square an 
output test point generated by the Complex-BP network. 
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INTRODUCTION 



BACKGROUND 



The typical recognition/classification framework in 
Artificial Vision uses a set of object features for dis- 
crimination. Features can be either numerical measures 
or nominal values. Once obtained, these feature values 
are used to classify the object. The output of the clas- 
sification is a label for the object (Mitchell, 1997). 

The classifier is usually built from a set of "training" 
samples. This is a set of examples that comprise feature 
values and their corresponding labels. Once trained, 
the classifier can produce labels for new samples that 
are not in the training set. 

Obviously, the extracted features must be discrimi- 
native. Finding a good set of features, however, may 
not be an easy task. Consider for example, the face 
recognition problem: recognize a person using the 
image of his/her face. This is currently a hot topic of 
research within the Artificial Vision community, see 
the surveys (Chellappa et al, 1995), (Samal & Iyengar, 
1992) and (Chellappa & Zhao, 2005). In this problem, 
the available features are all of the pixels in the image. 
However, only a number of these pixels are normally 
useful for discrimination. Some pixels are background, 
hair, shoulders, etc. Even inside the head zone of the 
image some pixels are less useful than others. The eye 
zone, for example, is known to be more informative 
than the forehead or cheeks (Wallraven et al, 2005). 
This means that some features (pixels) may actually 
increase recognition error, for they may confuse the 
classifier. 

Apart from performance, from a computational cost 
point of view it is desirable to use a minimum number 
of features. If fed with a large number of features, the 
classifier will take too long to train or classify. 



Feature Selection aims at identifying the most informa- 
tive features. Once we have a measure of "informative- 
ness" for each feature, a subset of them can be used for 
classifying. In this case, the features remain the same, 
only a selection is made. The topic of feature selec- 
tion has been extensively studied within the Machine 
Learning community (Duda et al, 2000). Alternatively, 
in Feature Extraction a new set of features is created 
from the original set. In both cases the objective is both 
reducing the number of available features and using 
the most discriminative ones. 

The following sections describe two techniques 
for Feature Extraction: Principal Component Analysis 
and Independent Component Analysis. Linear Dis- 
criminant Analysis (LDA) is a similar dimensionality 
reduction technique that will not be covered here for 
space reasons, we refer the reader to the classical text 
(Duda et al., 2000). 



Figure 1. 
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As an example problem we will consider face rec- 
ognition. The face recognition problem is particularly 
interesting here because of a number of reasons. First, 
it is a topic of increasingly active research in Artificial 
Vision, with potential applications in many domains. 
Second, it has images as input, see Figure 1 from the 
Yale Face Database (Belhumeur et al, 1997), which 
means that some kind of feature processing/selection 
must be done previous to classification. 



PRINCIPAL COMPONENT ANALYSIS 

Principal Component Analysis (PCA), see (Turk & 
Pentland, 1991), is an orthogonal linear transformation 
of the input feature space. PCA transforms the data to 
a new coordinate system in which the data variances 
in the new dimensions is maximized. Figure 2 shows a 
2-class set of samples in a 2-feature space. These data 
have a certain variance along the horizontal and verti- 
cal axes. PCA maps the samples to a new orthogonal 
coordinate system, shown in bold, in which the sample 
variances are maximized. The new coordinate system 
is centered on the data mean. 

The new set of features (note that the coordinate 
axes are features) is better from a discrimination point 



of view, for samples of the two classes can be readily 
separated. Besides, the PCA transform provides an 
ordering of features, from the most discriminative (in 
terms of variance) to the least. This means that we can 
select and use only a subset of them. In the figure above, 
for example, the coordinate with the largest axis is the 
most discriminative. 

When the input space is an image, as in face recogni- 
tion, training images are stored in a matrix T. Each row 
of T contains a training image (the image rows are laid 
consecutively, forming a vector). Thus, each image pixel 
is considered a feature. Let there be n training images. 
PCA can then be done in the following steps: 



1. 



Subtract the mean image vector m from T, 
where: 



m 



42> 



2. Calculate the covariance matrix C: 

c = -2>, -"Ota -< 

n l 

3. Perform Singular Value Decomposition over C, 
which gives an orthogonal transform matrix W 



Figure 2. 




Featire 1 
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Figure 3. 




4. Choose a set of "eigenfaces" (see below) 

The new feature axes are the columns of W. These 
features can be considered as images (i.e. by arrang- 
ing each vector as a matrix), and are commonly called 
basis images or eigenfaces within the face recognition 
community. The intensity of the pixels of these images 
represents their weight or contribution in the axis. Figure 
3 shows the typical aspect of eigenfaces. 

Normally, in step 4 above only the best K eigen- 
vectors are selected and used in the classifier. That is 
achieved by discarding a number of columns in W. 
Once we have the appropriate transform matrix, any 
set X of images can be transformed to this new space 
simply by: 

1. Subtract the mean m from the images in X 

2. Calculate Y = XW 

The transformed image vectors Fare the new feature 
vectors that the classifier will use for training and/or 
classifying. 



INDEPENDENT COMPONENT ANALYSIS 

Independent Component Analysis is a feature extraction 
technique based in extracting statistically independent 
variables from a mixture of them (Jutten & Herault, 



1991, Comon, 1994). ICA has been successfully ap- 
plied to many different problems such as MEG and 
EEG data analysis, blind source separation (i.e. 
separating mixtures of sound signals simultaneously 
picked up by several microphones), finding hidden 
factors in financial data and face recognition, see (Bell 
&Sejnowski, 1995). 

The ICA technique aims at finding a linear trans- 
form for the input data so that the transformed data 
is as statistically independent as possible. Statistical 
independence implies decorrelation (but note that the 
opposite is not true). Therefore, ICA can be considered 
a generalization of PCA. 

The basis images obtained with ICA are more local 
than those obtained with PCA, which suggests that they 
can lead to more precise representations. Figure 4 shows 
the typical basis images obtained with ICA. 

ICA routines are available in a number of different 
implementations, particularly for Matlab. ICA has a 
higher computational cost than PCA. FastICA is the 
most efficient implementation to date, see (Gavert et 
al, 2005). 

As opposed to PCA, ICA does not provide an in- 
trinsic order for the representation coefficients of the 
face images, which does not help when extracting a 
subset of K features. In (Bartlett & Sejnowski, 1997) 
the best results were obtained with an order based on 
the ratio of between-class to within-class variance for 
each coefficient: 



Figure 4. 
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V_ 
v 



IA_-*) 2 



where Vis the variance of they class mean and v is the 
sum of the variances within each class. 



FUTURE TRENDS 

PCA has been shown to be an invaluable tool in 
Artificial Vision. Since the seminal work of (Turk & 
Pentland, 1991) PCA-based methods are considered 
standard baselines in the problem of face recognition. 
Many other techniques have evolved from it: robust 
PCA, nonlinear PCA, incremental PCA, kernel PCA, 
probabilistic PCA, etc. 

As mentioned above, ICA can be considered an 
extension of PCA. Some authors have shown that in 
certain cases the ICA transformation does not provide 
performance gain over PCA when a "good" classifier is 
used (like Support Vector Machines), see (Deniz et al, 
2003). This maybe of practical significance, since PCA 
is faster than ICA. ICA is not being used as extensively 
within the Artificial Vision community as it is in other 
disciplines like signal processing, especially where the 
problem of interest is signal separation. 

On the other hand, Graph Embedding (Yan et al, 
2005) is a framework recently proposed that constitutes 
an elegant generalization of PCA, LD A (Linear Discrim- 
inat Analysis), LPP (Locality Preserving Projections) 
and other dimensionality reduction techniques. As well 
as providing a common formulation, it facilitates the 
designing of new dimensionality reduction algorithms 
based on new criteria. 



CONCLUSION 

Component analysis is a useful tool for Artificial Vision 
Researchers. PCA, in particular, cannowbe considered 
indispensable to reduce the high dimensionality of im- 
ages. Both computation time and error ratios can be 
reduced. This article has described both PCA and the 
related technique ICA, focusing on their application 
to the face recognition problem. 

Both PCA and ICA act as a feature extraction stage, 
previous to training and classification. ICA is computa- 



tionally more demanding, however its efficiency over 
PCA has not yet been established in the context of face 
recognition. Thus, it is foreseeable that the eigenfaces 
technique introduced by Turk and Pentland remains as 
a face recognition baseline in the near future. 
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KEY TERMS 

Classifier: Algorithm that produces class labels 
as output, from a set of features of an object. A clas- 
sifier, for example, is used to classify certain features 
extracted from a face image and provide a label (an 
identity of the individual). 

Eigenface: A basis vector of the PCA transform, 
when applied to face images. 

Face Recognition: The AV problem of recogniz- 
ing an individual from one or more images of his/her 
face. 

Feature Extraction: The process by which a new 
set of discriminative features is obtained from those 
available. Classification is performed using the new 
set of features. 

Feature Selection: The process by which a subset 
of the available features (usually the most discrimina- 
tive ones) is selected for classification. 

Independent Component Analysis: Feature extrac- 
tion technique in which the statistical independence of 
the data is maximized. 

Principal Component Analysis: Feature extraction 
technique in which the variance of the data is maxi- 
mized. It provides a new feature space in which the 
dimensions are ordered by sample correlation. Thus, 
a subset of these dimensions can be chosen in which 
samples are minimally correlated. 
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INTRODUCTION 

Biomedical imaging represents a practical and concep- 
tual revolution in the applied sciences of the last thirty 
years. Two basic ingredients permitted such a break- 
through: the technological development of hardware 
for the collection of detailed information on the organ 
under investigation in a less and less invasive fashion; 
the formulation and application of sophisticated math- 
ematical tools for signal processing within a method- 
ological setting of truly interdisciplinary flavor. 

Atypical acquisition procedure in biomedical imag- 
ing requires the probing of the biological tissue by means 
of some emitted, reflected or transmitted radiation. Then 
a mathematical model describing the image formation 
process is introduced and computational methods for 
the numerical solution of the model equations are 
formulated. Finally, methods based on or inspired by 
Artificial Intelligence (AI) frameworks like machine 
learning are applied to the reconstructed images in order 
to extract clinically helpful information. 

Important issues in this research activity are the 
intrinsic numerical instability of the reconstruction 
problem, the convergence properties and the computa- 
tional complexity of the image processing algorithms. 
Such issues will be discussed in the following with the 
help of several examples of notable significance in the 
biomedical practice. 



BACKGROUND 

The first breakthrough in the theory and practice of 
recent biomedical imaging is represented by X-ray 
Computerized Tomography (CT) (Hounsfield, 1973). 
On October 11 1979 Allan Cormack and Godfrey 
Hounsfield gained the Nobel Prize in medicine for 
the development of computer assisted tomography. 
In the press release motivating the award, the Nobel 
Assembly of the Karolinska Institut wrote that in 



this revolutionary diagnostic tool "the signals[...]are 
stored and mathematically analyzed in a computer. The 
computer is programmed to reconstruct an image of 
the examined cross-section by solving a large number 
of equations including a corresponding number of 
unknowns". Starting from this crucial milestone, bio- 
medical imaging has represented a lively melting pot 
of clinical practice, experimental physics, computer 
science and applied mathematics, providing mankind 
of numerous non-invasive and effective instruments for 
early detection of diseases, and scientist of a prolific 
and exciting area for research activity. 

The main imaging modalities in biomedicine can 
be grouped into two families according to the kind of 
information content they provide. 

Structural imaging: the image provides infor- 
mation on the anatomical features of the tissue 
without investigating the organic metabolism. 
Structural modalities are typically characterized 
by a notable spatial resolution but are ineffective 
in reconstructing the dynamical evolution of the 
imaging parameters. Further to X-ray CT, other 
examples of such approach are Fluorescence 
Microscopy (Rost & Oldfield, 2000), Ultrasound 
Tomography (Greenleaf, Gisvold & Bahn, 1 982), 
structural Magnetic Resonance Imaging (MRI) 
(Haacke, Brown, Venkatesan & Thompson, 1 999) 
and some kinds of prototypal non-linear tomog- 
raphies like Microwave Tomography (Boulyshev, 
Souvorov, Semenov, Posukh & Sizov, 2004), 
Diffraction Tomography (Guo & Devaney, 2005), 
Electrical Impedance Tomography (Cheney, Isaa- 
cson & Newell, 1999) and Optical Tomography 
(Arridge, 1999). 

Functional imaging: during the acquisition many 
different sets of signals are recorded according 
to a precisely established temporal paradigm. 
The resulting images can provide information 
on metabolic deficiencies and functional diseases 
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but are typically characterized by a spatial reso- 
lution which is lower (sometimes much lower) 
than the one of anatomical imaging. Emission 
tomographies like Single Photon Emission Com- 
puterized Tomography (SPECT) (Duncan, 1997) 
or Positron Emission Tomography (PET) (Valk, 
Bailey, Townsend & Maisey, 2004) and Magnetic 
Resonance Imaging in its functional setup (f MRI) 
(Huettel, Song & McCarthy, 2004) are examples of 
these dynamical techniques together with Electro- 
and Magnetoencephalography (EEG and MEG) 
(Zschocke & Speckmann, 1993; Hamalainen, 
Hari, Ilmoniemi, Knuutila & Lounasmaa, 1993), 
which reproduce the neural activity at a millisec- 
ond time scale and in a completely non-invasive 
fashion. 

In all these imaging modalities the correct math- 
ematical modeling of the imaging problem, the for- 
mulation of computational algorithms for the solution 
of the model equations and the application of image 
processing algorithms for data interpretation are the 
crucial steps which allow the exploitness of the visual 
information from the measured raw data. 



MAIN FOCUS 

From a mathematical viewpoint the inverse problem 

of synthesizing the biological information in a visual 
form from the collected radiation is characterized by 
a peculiar pathology. 

The concept of ill-posedness has been introduced 
by Jules Hadamard (Hadamard, 1 923) to indicate math- 
ematical problems whose solution does not exist for 
all data, or is not unique or does not depend uniquely 
on the data. In biomedical imaging this last feature has 
particularly deleterious consequences: indeed, the pres- 
ence of measurement noise in the raw data may produce 
notable numerical instabilities in the reconstruction 
when naive approaches are applied. 

Most (if not all) biomedical imaging problems are 
ill-posed inverse problems (Bertero & Boccacci, 1 998) 
whose solution is a difficult mathematical task and often 
requires a notable computational effort. The first step 
toward the solution is represented by an accurate model- 
ing of the mathematical relation between the biological 
organ to be imaged and the data provided by the imaging 



device. Under the most general assumptions the model 
equation is a non-linear integral equation, although, for 
several devices, the non-linear imaging equation can 
be reliably approximated by a linear model where the 
integral kernel encodes the impulse response of the 
instrument. Such linearization can be either performed 
through a precise technological realization, like in MRI, 
where acquisition is designed in such a way that the 
data are just the Fourier Transform of the object to be 
imaged; or obtained mathematically, by applying a sort 
of perturbation theory to the non-linear equation, like 
in diffraction tomography whose model comes from 
the linearization of the scattering equation. 

The second step toward image reconstruction is 
given by the formulation of computational methods 
for the reduction of the model equation. In the case of 
linear ill-posed inverse problems, a well-established 
regularization theory exists which attenuates the nu- 
merical instability related to ill-posedness maintaining 
the biological reliability of the reconstructed image. 
Regularization theory is at the basis of most linear 
imaging modalities and regularization methods can be 
formulated in both a probabilistic and a deterministic 
setting. Unfortunately an analogously well- established 
theory does not exist in the case of non-linear imaging 
problems which therefore are often addressed by means 
of 'ad hoc' techniques. 

Once an image has been reconstructed from the data, 
a third step has to be considered, i.e. the processing of 
the reconstructed images for the extraction and inter- 
pretation of their information content. Three different 
problems are typically addressed at this stage: 

Edge detection (Trucco & Verri, 1 998). Computer 
vision techniques are applied in order to enhance 
the regions of the image where the luminous 
intensity changes sharply. 
Image integration (Maintz & Viergever, 1 998). In 
the clinical workflow several images of a patient 
are taken with different modalities and geometries. 
These images can be fused in an integrated model 
by recovering changes in their geometry. 
Image segmentation (Acton & Ray, 2007). Partial 
volume effects make the interfaces between the 
different tissues extremely fuzzy, thus complicat- 
ing the clinical interpretation of the restored im- 
ages. An automatic procedure for the partitioning 
of the image in homogeneous pixel sets and for 
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the classification of the segmented regions is at 
the basis of any Computed Aided Diagnosis and 
therapy (CAD) software. 

AI algorithms and, above all, machine learning play 
a crucial role in addressing these image processing is- 
sues. In particular, as a subfield of machine learning, 
pattern recognition provides a sophisticated descrip- 
tion of the data which, in medical imaging, allows to 
locate tumors and other pathologies, measure tissue 
dimensions, favor computer-aided surgery and study 
anatomical structures. For example, supervised ap- 
proaches like backpropagation (Freeman & Skapura, 
1991) or boosting (Shapire, 2003) accomplish classifica- 
tion tasks of the different tissues from the knowledge 
of previously interpreted images; while unsupervised 
methods like Self-Organizing Maps (SOM) (Kohonen, 
2001), fuzzy clustering (De Oliveira & Pedrycz, 2007) 
and Expectation-Maximization (EM) (McLachlan & 
Krishnan, 1996) infer probabilistic information or 
identify clustering structures in sets of unlabeled im- 
ages. From a mathematical viewpoint, several of these 
methods correspond more to heuristic recipes than 
to rigorously formulated and motivated procedures. 
However, since the last decade the theory of statisti- 
cal learning (Vapnik, 1998) has appeared as the best 
candidate for a rigorous description of machine learning 
within a functional analysis framework. 



FUTURE TRENDS 

Among the main goals of recent biomedical imaging 
we point out the realization of 



tubes for data acquisition and computational methods 
for the reduction of beam hardening effects; electro- 
physiological and structural information on the brain 
can be collected by performing an EEG recording 
inside an MRI scanning but also using the structural 
information from MRI as a prior information in the 
analysis of the EEG signal accomplished in a Bayesian 
setting; finally, non- invasivity in colonoscopy can be 
obtained by utilizing the most recent acquisition de- 
sign in X-ray tomography together with sophisticated 
softwares which allow virtual navigation within the 
bowel, electronic cleansing and automatic classification 
of cancerous and healthy tissues. 

From a purely computational viewpoint, two im- 
portant goals in machine learning applied to medical 
imaging are the development of algorithms for semi- 
supervised learning and for the automatic integration 
of genetic data with information coming from the 
acquired imagery. 



CONCLUSION 

Some aspects of recent biomedical imaging have been 
described from a computational science perspective. 
The biomedical image reconstruction problem has been 
discussed as an ill-posed inverse problem where the 
intrinsic numerical instability producing image artifacts 
can be reduced by applying sophisticated regulariza- 
tion methods. The role of image processing based 
on machine learning techniques has been described 
together with the main goals of recent biomedical 
imaging applications. 



microimaging techniques which allow the inves- 
tigation of biological tissues of micrometric size 
for both diagnostic and research purposes; 
hybrid systems combining information from 
different modalities, possibly anatomical and 
functional; 

highly non-invasive diagnostic tools, where even 
a modest discomfort is avoided. 

These goals can be accomplished only by means 
of an effective interplaying of hardware development 
and application of innovative image processing algo- 
rithms. For example, microtomography for biological 
samples requires the introduction of both new X-ray 
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KEY TERMS 

Computer Aided Diagnosis (CAD): The use of 

computers for the interpretation of medical images. 
Automatic segmentation is one of the crucial task of 
any CAD product. 

Edge Detection: Image processing technique for 
enhancing the points of an image at which the luminous 
intensity changes sharply. 

Electroencephalography (EEG): Non-invasive 
diagnostic tool which records the cerebral electrical 
activity by means of surface electrodes placed on the 
skull. 

Ill-Posedness: Mathematical pathology of differen- 
tial or integral problems, whereby the solution of the 
problem does not exist for all data, or is not unique or 
does not depend continuously on the data. In computa- 
tion, the numerical effects of ill-posedness are reduced 
by means of regularization methods. 
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Image Integration: In medical imaging, combina- 
tion of different images of the same patient acquired 
with different modalities and/or according to different 
geometries. 

Magnetic Resonance Imaging (MRI): Imaging 
modality based on the principles of nuclear magnetic 
resonance (NMR), a spectroscopic technique used to 
obtain microscopic chemical and physical information 
about molecules. MRI can be applied in both functional 
and anatomical settings. 

Magnetoencephalography (MEG): Non-invasive 
diagnostic tool which records the cerebral magnetic 
activity by means of superconducting sensors placed 
on a helmet surrounding the brain. 



Segmentation: Image processing technique for 
distinguishing the different homogeneous regions in 
an image. 

Statistical Learning: Mathematical framework 
which utilizes functional analysis and optimazion tools 
for studying the problem of inference. 

Tomography: Imaging technique providing two- 
dimensional views of an object. The method is used 
in many disciplines and may utilize input radiation of 
different nature and wavelength. There exist X-ray, 
optical, microwave, diffraction and electrical imped- 
ance tomographies. 
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INTRODUCTION 

Applying biological concepts to create new models 
in the computational field is not a revolutionary idea: 
science has already been the basis for the famous 
artificial neuron models, the genetic algorithms, etc. 
The cells of a biological organism are able to compose 
very complex structures from a unique cell, the zygote, 
with no need for centralized control (Watson J.D. & 
Crick F. H. 1953). The cells can perform such process 
thanks to the existence of a general plan, encoded in 
the DNA for the development and functioning of the 
system. Another interesting characteristic of natural 
cells is that they form systems that are tolerant to partial 
failures: small errors do not induce a global collapse of 
the system. Finally, the tissues that are composed by 
biological cells present parallel information processing 
for the coordination of tissue functioning in each and 
every cell that composes this tissue. 

All the above characteristics are very interesting 
from a computational viewpoint. This paper presents 
the development of a model that tries to emulate the 
biological cells and to take advantage of some of their 
characteristics by trying to adapt them to artificial cells. 
The model is based on a set of techniques known as 
Artificial Embryology (Stanley K. & Miikkulainen 
R. 2003) or Embryology Computation (Kumar S. & 
Bentley P.J 2003). 



BACKGROUND 

The Evolutionary Computation (EC) field has given 
rise to a set of models that are grouped under the name 
of Artificial Embryology (AE), first introduced by Stan- 
ley and Miikkulainnen (Stanley K. & Miikkulainen R. 
2003). This group refers to all the models that try to apply 
certain characteristics of biological embryonic cells to 



computer problem solving, i.e. self-organisation, failure 
tolerance, and parallel information processing. 

The work on AE has two points of view. On the one 
hand can be found the grammatical models based on 
L-systems (Lindenmayer A. 1 968) which do a top-down 
approach to the problem. On the other hand can be 
found the chemical models based on the Turing's ideas 
(Turing A. 1952) which do a down-top approach. 

On the last one, the starting point of this field can 
be found in the modelling of gene regulatory networks, 
performed by Kauffmann in 1969 (Kauffman S.A. 
1969). After that, several works were carried out on 
subjects such as the complex behaviour generated by 
the fact that the differential expression of certain genes 
has a cascade influence on the expressions of others 
(Mjolsness E., Sharp D.H., & Reinitz J. 1995). 

The work performed by the scientific community can 
be divided into two main branches. The more theoreti- 
cal branch uses the emulation of cell capabilities such 
as cellular differentiation and metabolism (Kitano H. 
1 994; Kaneko K. 2006) to create a model that functions 
as a natural cell. The purpose of this work is to do an 
in-depth study of the biological model. 

The more practical branch mainly focuses on the 
development of a cell inspired-model that might be 
applicable to other problems (Bentley, P.J., Kumar, 
S. 1999; Kumar, S. 2004). According to this model, 
every cell would not only have genetic information 
that encodes the general performance of the system, it 
would also act as a processor that communicates with 
the other cells. This model is mainly applied to the 
solution of simple 3D spatial problems, robot control, 
generative encoding for the construction of artificial 
organisms in simulated physical environments and real 
robots, or to the development of the evolutionary design 
of hardware and circuits (Endo K., Maeno T. & Kitano 
H 2003; Tufte G. & Haddow P. C. 2005). 
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Considering the gene regulatory networks works, the 
most relevant models are the following: the Kumar and 
Bentley model (Kumar S. & Bentley P. J 2003), which 
uses the Bentley's theory of fractal proteins (Bentley, 
P. J. 1999); for the calculation of protein concentra- 
tion; the Eggenberger model (Eggenberger P. 1996), 
which uses the concepts of cellular differentiation and 
cellular movement to determine cell connections; and 
the work of Dellaert and Beer (Dellaert F. & Beer R.D. 
1996), who propose a model that incorporates the idea 
of biological operons to control the model expression, 
where the function assumes the mathematical meaning 
of a Boolean function. 

All these models can be regarded as special cel- 
lular automata. In cellular automata, a starting cell set 
in a certain state will turn into a different set of cells 
in different states when the same transition function 
(Conway J.H. 1971) is applied to all the cells during a 
determined lapse of time in order to control the message 
concurrence among them. The best known example of 
cellular automats is Conway's "Game of Life", where 
this behaviour can be observed perfectly. Whereas the 
classical conception specifies the behaviour rules, the 
evolutionary models establish the rules by searching 
for a specific behaviour. This difference comes from the 
mathematical origin of the cellular automats, whereas 
the here presented models are based on biology and 
embryology. 

These models should not be confused with other 
concepts that might seem similar, such as Gene Expres- 
sion Programming (GEP) (Ferreira C. 2006). Although 
GEP codifies the solution in a string, similarly as how 
it is done in the present work, the solution program 
is developed in a tree shape, as in classical genetic 
programming (Koza, J. et. al.1999) which has little or 
nothing in common with the presented models. 



promoter, which identifies the proteins that are needed 
for gene transcription. 

Another remarkable aspect of biological genes is the 
difference between constitutive genes and regulating 
genes. The latter are transcribed only when the proteins 
identified in the promoter part are present. The constitu- 
tive genes are always transcribed, unless inhibited by 
the presence of the proteins identified in the promoter 
part, acting then as gene oppressors. 

The present work has tried to partially model this 
structure with the aim of fitting some of its abilities into 
a computational model; in this way, the system would 
have a structure similar that is similar to the above and 
will be detailed in the next section. 

Proposed Model 

Various model variants were developed on the basis 
of biological concepts. The proposed artificial cellular 
system is based on the interaction of artificial cells 
by means of messages that are called proteins. These 
cells can divide themselves, die, or generate proteins 
that will act as messages for themselves as well as for 
neighbour cells. 

The system is supposed to express a global behav- 
iour towards the generation of structures in 2D. Such 
behaviour would emerge from the information encoded 
in a set of variables of the cell that, in analogy with the 
biological cells, will be named genes. 

One promising application, in which we are work- 
ing, could be the compact encoding of adaptive shapes, 
similar to the functioning of fractal growth or the fractal 
image compression. 



ARTIFICIAL EMBRYOGENY MODEL 

The cells of a biological system are mainly determined 
by the DN A strand, the genes, and the proteins contained 
by the cytoplasm. The DNA is the structure that holds 
the gene-encoded information that is needed for the 
development of the system. The genes are activated or 
transcribed thanks to the protein shaped-information 
that exists in the cytoplasm, and consist of two main 
parts: the sequence, which identifies the protein that 
will be generated if the gene is transcribed, and the 



Figure 1. Structure of a system gene 



DNA 
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The central element of our model is the artificial cell. 
Every cell has a binary string-encoded information for 
the regulation of its functioning. Following the biologi- 
cal analogy, this string will be called DNA. The cell also 
has a structure for the storage and management of the 
proteins generated by the own cell and those received 
from neighbourhood cells; following the biological 
model, this structure is called cytoplasm. 

The DNA of the artificial cell consists of functional 
units that are called genes. Each gene encodes a protein 
or message (produced by the gene). The structure of a 
gene has four parts (see Figure 1): 

Sequence: the binary string that corresponds to 
the protein that encodes the gene 
Promoters: is the gene area that indicates the 
proteins that are needed for the gene's transcrip- 
tion. 

Constituent: this bit identifies if the gene is 
constituent or regulating 
Activation percentage (binary value): the per- 
centage of minimal concentration of promoters 
proteins inside the cell that causes the transcription 
of the gene. 

The other fundamental element for keeping and 
managing the proteins that are received or produced by 
the artificial cell is the cytoplasm. The stored proteins 
have a certain life time before they are erased. The 
cytoplasm checks which and how many proteins are 
needed for the cell to activate the DNA genes, and as 
such responds to all the cellular requirements for the 
concentration of a given type of protein. The cytoplasm 
also extracts the proteins from the structure in case they 
are needed for a gene transcription. 

Model Functioning 

The functioning of genes is determined by their type, 
which can be constituent or regulating. The transcrip- 
tion of the encoded protein occurs when the promoters 
of the non-constituent genes appear in a certain rate at 
the cellular cytoplasm. On the other hand, the constitu- 
ent genes are expressed during all the "cycles" until 
such expression is inhibited by the present rate of the 
promoter genes. 

Protein Concentration Percent>= 
(Distance+1 * Activation Percent (1) 



The activation of the regulating genes or the inhibi- 
tion of the constituent genes is achieved if the condition 
expressed by Eq.l is fulfilled, whereProtein Concentra- 
tion Percentage represents the cytoplasm concentration 
of the protein that is being considered; Distance stands 
for the Hamming distance between one promoter and 
the considered protein; andActivation Percentage is the 
minimal percentage needed for the gene activation that 
is encoded in the gene. This equation is tested on each 
promoter and each protein. If the condition is fulfilled 
for all the promoters, that gene is transcribed. According 
to this, if gene-like promoters exist in a concentration 
higher than the encoded concentration, they can also 
induce its transcription, similarly to what happens in 
biology and therefore providing the model with higher 
flexibility. If the condition is fulfilled for each promoter, 
the gene is activated and therefore transcribed. 

After the activation of one of the genes, three things 
can happen: the generated protein may be stored in the 
cell cytoplasm, it may be communicated to the neigh- 
bour cells, or it may induce cellular division (mitosis) 
and/or death (apoptosis). The different events of a 
tissue are managed in the cellular model by means of 
"cellular cycles". Such "cycles" will contain all the 
actions that can be carried out by the cells, restricting 
sometimes their occurrence. The "cellular cycles" can 
be described as follows: 

Actualisation of the life time of proteins in the 

cytoplasm 

Verification of the life status of the cell (cellular 

death) 

Calculation of the genes that react and perform 

the special behaviour that may be associated to 

them 

Communication between proteins 

Solution Search 

A classical approach of EC proposes the use of Genetic 
Algorithms (GA) (Fogel L.J., Owens A. J. & Walsh 
M.A. 1966; Goldberg D.E. 1989; Holland J.H. 1975) 
for the optimisation, in this case, of the values of the 
DNA genes (binary strands). Each individual of the 
GA population will represent a possible DNA strand 
for problem solving. 

In order to calculate the fitness value for every indi- 
vidual in the GA or the DNA, the strand is introduced 
into an initial cell or zygote. After simulating during a 
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Figure 2. (Above) Three promoters and a PCS struc- 
ture (Below); Example ofGA genes association/or the 
encoding of cellular genes 
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certain number of cycles, the contained information is 
expressed and the characteristics of the resulting tissue 
are evaluated by means of various criteria, according 
to the goal that is to be achieved. 

The encoding of the individual genes follows a 
structure that is similar to the one described in Figure 
2 (Above), where the number of promoters of each 
gene may, vary but the white and indivisible section 
"Activation Percentage - Constituent - Sequence" 
(PCS) must always be present. The PCS sections de- 
termine the genes of the individual, and the promoter 
sections are associated to the PCS sections, as shown 
in Figure 2(Below). 

The search of a set of structures similar to those 
shown in Figure 2 required the adaptation of the 
crossover and mutation GA operations to this specific 
problem. Since the length of the individuals is variable, 
the crossover had to be performed according to these 
lengths. When an individual is selected, a random per- 
centage is generated to determine the crossover point 
of that individual. After selecting the section in that 
position, a crossover point is chosen for the section 
selected in the other parent. Once this has been done, 
the crossover point selection process is repeated in the 
second selected parent in the same position as in the 
previous individual. From this stage on, the descend- 
ants are composed in the traditional way, since they 
are two strings of bits. We could execute a normal bit 
strings crossover, but the previously mentioned steps 
guarantee that the descendants are valid solutions for 
the DNA strands transformation. 

With regards to mutation, it should be mentioned 
that the types of the promoter or PCS sections are 
identified according to the value of the first string bit. 
Bearing that in mind, together with the variable length 
of individuals, the mutation operation had to be adapted 



so that it could modify not only the number of these 
sections, but also the value of a given section. 

The probability of executing the mutation is usu- 
ally low, but this time it even had to be divided into 
the three possible mutation operations that the system 
contemplates. Various tests proved that the most suit- 
able values for the distribution of the different mutation 
operations, after the selection of a position for mutation, 
were the following: for 20% of the opportunities, a sec- 
tion (either a promoter or a PCS) is added; for another 
20%, the existing section is removed; and finally, for the 
remaining 60% of the opportunities, the value of one of 
the bits of the section is randomly changed. The latter 
may provoke not only the change of one of the values, 
but also the change of the section type: if the bit that 
identifies the section type is changed, the information 
of that section varies. For instance, if a promoter section 
turns into a PCS section, the promoter sequence turns 
into the gene sequence, and constitutive and activation 
percentage values are generated. 

After reaching this development level and presenting 
the test set in (Fernandez-Bianco E. , Dorado J. , Rabunal 
J.R., Gestal M. & Pedreira N. 2007), the authors con- 
cluded that the bottleneck of the model turned out to 
be the development of the evaluation functions, since 
in every new figure the development of the function 
was time-consuming and not reusable. 

In order to solve this problem, the evaluation function 
was developed according to the concept of a correc- 
tion template. From the tissue that is developed by the 
DNA that is being evaluated, the centroid is calculated. 
This point would be the center of the solution template, 
which is merely a matrix of Boolean values representing 
the figure that is aimed at. The template could be (and 
usually is) smaller than the development environment 
of the tissue, which means that every cell that may not 
be covered by the template will contribute to the tissue 
error with 1.0. The remaining tissue, covered by the 



Figure 3. Tissue + Template. Example of Template 
use. 
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template, will execute the NEXOR Boolean operation 
in order to obtain the number of differences between 
the template and the tissue. Each difference contributes 
with a value of 1.0. to tissue error. 

Figure 3 illustrates the use of this method. We can 
observe that the error of this tissue with regard to the 
template is 2, since we generated a cell that is not con- 
templated by the template, whereas another cell that is 
present in the template is really missing. 



FUTURE TRENDS 

The model could also include new characteristics such 
as the displacement of cells around their environment, 
or a specialisation operator that blocks pieces of DNA 
during the expression of its descendants, as happens 
in the natural model. 

Finally, this group is currently working in one of the 
possible applications of this model: its use for image 
compression similarly as fractal compression works. 
The fractal compression searches the parameters of a 
fractal formula that encodes itself the starting image. 
The present model searches the gene sequence that 
might result in the starting image. In this way, the 
method based on template that has been presented in 
this paper can be used for performing that search, using 
the starting image as template. 



CONCLUSION 

Taking into account the here developed model, we 
can say that the use of certain properties of biological 
cellular systems is feasible for the creation of artificial 
structures that might be used in order to solve certain 
computational problems. 

Some behaviours of the biological model have been 
also observed in the artificial model: information re- 
dundancy in DNA, stability after achieving the desired 
shape, or variability in gene behaviour. 
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KEY TERMS 

Artificial Cell: Each of the elements that process 
the orders codified into the DNA. 

Artificial Embryogeny: The term overlaps all the 
processing models which use biological development 
ideas as inspiration for its functioning. 

Cellular Cycle: Cellular development time unit 
which limits the ocurrents number of certain cellular 
development actions. 

Cytoplasm: Part of an artificial cell which is respon- 
sible of management the protein-shaped messages. 

DNA: Set of rules which are responsible of the 
cell behaviour. 

Gene: Each of the rules which codifies one action 
of the cell. 

Protein: This term identifies every kind of the mes- 
sages that receives an artificial cell. 

Zygote: The initial cell from where a tissue is 
generated using the DNA information. 
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INTRODUCTION 

During the past several decades, a number of attempts 
have been made to contain oil slicks (or any surface 
contaminants) in the open sea by means of a floating 
barrier. Many of those attempts were not very successful 
especially in the presence of waves and currents. The 
relative capabilities of these booms have not been prop- 
erly quantified for lack of standard analysis or testing 
procedure (Hudon, 1 992). In this regard, more analysis 
and experimental programs to identify important boom 
effectiveness parameters are needed. 

To achieve the desirable performance of floating 
booms in the open sea, it is necessary to investigate 
the static and dynamic responses of individual boom 
sections under the action of waves; this kind of test is 
usually carried out in a wave flume, where open sea 
conditions can be reproduced at a scale. 

Traditional methods use capacitance or conductiv- 
ity gauges (Hughes, 1993) to measure the waves. One 
of these gauges only provides the measurement at 
one point; further, it isn't able to detect the interphase 
between two or more fluids, such as water and a hy- 
drocarbon. An additional drawback of conventional 
wave gauges is their cost. 

Other experiments such as velocity measurements, 
sand concentration measurements, bed level mea- 
surements, breakwater's behaviour, etc... and the set 
of traditional methods or instruments used in those 
experiments which goes from EMF, ADV for veloc- 
ity measurements to pressure sensors, capacity wires, 
acoustic sensors, echo soundings for measuring wave 
height and sand concentration, are common used in 
wave flume experiments. All instruments have an as- 
sociate error (Van Rijn, Grasmeijer & Ruessink, 2000), 
and an associate cost (most of them are too expensive 
for a lot of laboratories that can not afford pay those 
amount of money), certain limitations and some of 
them need a large term of calibration. 



This paper presents another possibility for wave 
flume experiments, computer vision, which used a 
cheap and affordable technology (common video cam- 
eras and pc's), it is calibrated automatically (once we 
have developed the calibration task), is a non-intrusive 
technology and its potential uses could takes up all kind 
experiments developed in wave flumes. Are artificial 
vision's programmers who can give computer vision 
systems all possibilities inside the visual field of a 
video camera. Most experiments conducted in wave 
flumes and new ones can be carried out programming 
computer vision systems. In fact, in this paper, a new 
kind of wave flume experiment is presented, a kind of 
experiment that without artificial vision technology it 
couldn't be done. 



BACKGROUND 

Wave flume experiments are highly sensitive to what- 
ever perturbation; therefore, the use of non-invasive 
measurement methodologies is mandatory if mean- 
ingful measures are desired. In fact, theoretical and 
experimental efforts whose results have been proposed 
in the literature have been mainly conducted focusing 
on the equilibrium conditions of the system (Niederoda 
and Dalton, 1982), (Kawata and Tsuchiya, 1988). 

In contrast with most traditional methods used in 
wave flume experiments computer vision systems are 
non-invasive ones since the camera is situated outside 
the tank and in addition provide better accuracy than 
most traditional instruments. 

The present work is part of a European Commission 
research project, "Advanced tools to protect the Gali- 
cian and Northern Portuguese coast against oil spills 
at sea", in which a number of measurements in a wave 
flume must be conducted, such as the instantaneous 
position of the water surface or the motions (Milgran, 
1 97 1 ) of a floating containment boom to achieve these 
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objectives, a non-intrusive method is necessary (due to 
the presence of objects inside the tank) and the method 
has to be able to differentiate between at least two dif- 
ferent fluids, with the oil slick in view. 

Others works using image analysis to measure sur- 
face wave profile, have been developed over the past 
ten years (e.g., Erikson and Hanson, 2005; Garcia, Her- 
ranz, Negro, Varela & Flores, 2003; Javidi and Psaltis, 
1 999; Bonmarin, Rochefort & Bourguel, 1989; Zhang, 
1996), but they were developed neither with a real- 
time approach nor as non-intrusive methods. In some 
of these techniques it is necessary to colour the water 
with a fluorescent dye (Erikson and Hanson, 2005), 
which is not convenient in most cases, and especially 
when two fluids must be used (Flores, Andreatta, Llona 
& Saavedra, 1998). 



Figure 1. Template to image rectification. Crosses are 
equidistant with a 4cm separation. 




A FRAMEWORK FOR MEASURING 
WAVES LEVEL IN A WAVE FLUME 
WITH ARTIFICIAL VISION TECHNIQUES 

Following is presented an artificial vision system 
(Ibanez, Rabunal, Castro, Dorado, Iglesias & Pazos, 
2007) which obtains the free surface position in all 
points of the image, from which the wave heights can 
be computed. For this aim we have to record a wave 
tank (see laboratory set-up in section 2) while it is gen- 
erating waves and currents (a scale work frame), and 
after that we have to use the frames which make up the 
image to obtain the crest of the water (using computer 
vision techniques described in section 3) and translate 
the distances in the image to real distances (taking into 
account image rectification, see section 1). 

Image Rectification 

Lens distortion is an optical error in the lens that causes 
differences in magnification of the object at different 
points on the image; straight lines in the real world 
may appear curved on the image plane (Tsai, 1987). 
Since each lens element is radially symmetric, and the 
elements are typically placed with high precision on the 
same optical axis, this distortion is almost always radi- 
ally symmetric and is referred to as radial lens distortion 
(Ojanen, 1999). There are two kinds of lens distortion: 
barrel distortion and pincushion distortion. Most lenses 
exhibit both properties at different scales. 



To avoid lens distortion error and to provide a tool 
for transforming image distances (number of pixels) 
to real distances (mm) it is necessary to follow a rec- 
tification procedure. 

Most image rectification procedures involve a two 
step process (Ojanen, 1991). (Holland, Holman & Sal- 
lenger, 1991): calibration of intrinsic camera param- 
eters, and correction for a camera's extrinsic parameters 
(i.e., the location and rotation in space). 

However, in our case we are only interested in 
transforming pixel measurements into real distances 
(mm). Transforming points from a real world surface 
to a non-coplanar image plane would imply an operator 
which, when applied to all frames, would considerably 
slow down the total process, which is not appropriate 
for our real-time approach. 

So a .NET routine was developed to create a map 
with the corresponding factor (between pixel and 
real distances) for each group of pixels (four nearest 
control points on the target). Inputs to the model are a 
photographed image of the target sheet (see fig. 1), and 
target dimensions (spacing between control points in 
the x- and y-directions). 

Laboratory Set-Up and Procedure 

The experiment was conduced in a 17.29-m long wave 
flume at the Centre of Technological Innovation in 
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Figure 2. Laboratory set-up diagram 




Wavs generator 



Sidewall. tank 




Construction and Civil Engineering (CITEEC), in the 
University of A Coruna, Spain. The flume section is 
77 cm (height) x 59.6 cm (width). Wave generation is 
conducted by means of a piston-type paddle. A wave 
absorber is located near its end wall to prevent wave 
reflection. It consists of a perforated plate with a length 
of 3.04 m, which can be placed at different slopes. The 
experimental set-up is shown in fig. 2. 

With the aim of validating the system, solitary waves 
were generated and measured on the base of images 
recorded by a video camera mounted laterally, which 
captured a flume length of 1 m. The waves were also 
measured with one conductivity wave gauge located 
within the flume area recorded by the video camera. 
These gauges provide an accuracy of ± 1 mm at a 
maximum sampling frequency of 30 Hz. 

A video camera, Sony DCR-HC35E, was used in 
turn to record the waves; it worked on the PAL Western 
Europe standard, with a resolution of 720 x 576 pixels, 
recording 25 frames per second. 

The camera was mounted on a standard tripod and 
positioned approximately 2 m from the sidewall of 
the tank (see fig. 2). It remained fixed throughout the 
duration of a test. The procedure is as follows: 



Adjust the camera taking into account the tem- 
plate's marks. 

Provide uniform and frontal lighting for the tem- 
plate. 

Film the template. 

Provide uniform lighting on the target plane and 
a uniformly colored background on the opposite 
sidewall (to block any unwanted objects from the 
field of view); 
Start filming. 

The mark was placed horizontally on the glass 
sidewall of the flume, on the bottom of the filmed 
area in order to know a real distance between the bed 
of the tank and this mark, to avoid filming the bed of 
the tank and thus to film a smaller area (leading to a 
better resolution). 

With regard to the lighting of the laboratory it is 
necessary to avoid direct lighting and consequently we 
can work without gleam and glints. 

To achieve this kind of lighting, all lights in the 
laboratory were turned off and two halogen lamps of 
200W were placed on both sides of the filmed area, 
one in front the other (see fig. 2). 



Place one mark on the glass sidewall of the flume, 
on the bottom of the filmed area (see fig. 2); 
Place a template with equidistant marks (crosses) 
in a vertical plane parallel to the flume sidewall 
(see fig 1). 

Position the camera at a distance from the target 
plane (i.e., tank sidewall) depending on desired 
resolution. 



Video Image Post-Processing 

Image capture was carried out on a PC, Pentium 4, 3.00 
GHz and 1 .00 GB de RAM memory with the Windows 
XP platform. Filmed with the Sony DCR-HC35E 1 , a 
high-speed interface card, IEEE 1394 FireWireTM, 
was used to transfer digital data from the camcorder 
to the computer, and the still images were kept in the 
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uncompressed bitmap format so that information would 
not be lost. De-interlacing was not necessary because 
of the quality of the obtained images and everything 
was done on a real-time approach. 

An automatic tool for measuring waves from 
consecutive images was developed. The tool was de- 
veloped under .NET framework, using C++ language 
and OpenCV (Open Source Computer Vision Library, 
developed by Intel 2 ) library. The computer vision pro- 
cedure is as follows: 

Extract a frame from the video. 

Using different computer vision algorithms get 

the constant "pixel to mm" for each pixel. 

Using different computer vision algorithms, the 

crest of the wave is obtained. 

Work out the corresponding height, for all the 

pixels in the crest of the wave. 

Supply results. 

Repeat the process until the video finish. 

With regard to get the constant "pixel to mm", a 
template with equidistant marks (crosses) is placed right 
up the glass side wall of the tank and is filmed. Then a 
C++ routine recognize de centre of the crosses. 

Results 

A comparison of data extracted from video images 
with data measured by conventional instruments was 



done. The comparisons are not necessarily meant to 
validate the procedure as there are inherent errors 
with conventional instruments, as well; rather, the 
comparisons aim to justify the use of video images as 
an alternative method for measuring wave and profile 
change data. 

Different isolated measurements with conductivity 
gauge were done at the same time the video camera 
was recording. Then results from both methods were 
compared. 

The process followed to measure with conductivity 
gauge and the artificial vision system at the same time 
involves recognizing one point in x-axis (in the record 
video) where the gauge is situated (one color mark 
was pasted around the gauge to make easier this task) 
and after knowing the measure point of the gauge we 
create a file with the height of the wave in this x point 
for each image in the video. While the video-camera is 
recording one file with gauge measure is created. Once 
we have both measure files we have two determine 
manually the same time point in both files (due to the 
difficulty to initialize both systems at the same time). 
Now, we can compare both measurements. 

A lot of tests were done with different wave param- 
eters for wave period, wave height and using regular 
(sine form) and irregular waves. Test with waves be- 
tween 40mm and 200mm of height were done. 

Using the camera DCR-HC35E, figure 3 shows 
one example of a test done, where the used wave was 
an irregular one, with a maximum period of Is and a 



Figure 3. Temporal sequence of measurement by sensor and image analysis 
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maximum value for wave amplitude of 70mm, excel- 
lent results were obtained as it can be seen in figure 
4, where both measurements (conductivity sensor and 
video analysis) are quite similar. 

The correlation between sensor and video image 
analysis measurements has an associated mean square 
error of 0.9948. 

In spite of these sources of error, after several tests, 
the average error between conductivity sensor mea- 
surements and video analysis is 0.8 mm with camera 
DCR-HC35E, a lot better compared with the 5 mm 
average error obtained in the best work done until 
this moment (Erikson and Hanson, 2005). But it isn't 
an indicative error because of the commented source 
of errors taken into account in this study, however the 
estimated real error from this video analysis system is 
1 mm, that is to say, the equivalence between one pixel 
and a real distance, and in our case (with the commented 
video camera and distance from the tank) one pixel is 
equivalent to nearly 1mm. This error could be improv- 
able with a camera which allows a better resolution or 
focusing a smaller area. 



FUTURE TRENDS 

This is the first part of a bigger system which is capable 
of measuring the motions of a containment boom section 
in the vertical axis and its slope angle (Kim, Muralid- 
haran, Kee, Jonson, & Seymour, 1998). Furthermore 
the system would be capable of making a distinction 
between the water and a contaminant, and thus would 
identify the area occupied by each fluid. 

Another challenge is to test this system in other 
work spaces with different light conditions (i.e., in a 
different wave flume). 



CONCLUSION 

An artificial vision system was developed for these 
targets because these systems are non-intrusive and can 
separate a lot of different objects or fluids (anything 
that a human eye can differentiate) in the image and a 
non-intrusive method is necessary. 

Other interesting aspects that these systems provide 
are: 



Cheaper price than traditional systems of mea- 
surement. 

Easier and faster to calibrate. 
It is unnecessary to mount an infrastructure to 
know what happens at different points of the 
tank (only one camera instead of an array of sen- 
sors). 

As the system is a non-intrusive one, it doesn't dis- 
tort the experiments and their measurements. 
Provide high accuracy. 

Finally, this system is an innovation idea of ap- 
plying computer vision techniques to civil engi- 
neering area and specifically in ports and coasts 
field. No similar works have been developed. 
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KEY TERMS 

Color Spaces: (Konstantinos & Anastasios, 2000) 
supply a method to specify, sort and handle colors. 
These representations match n-dimensional sorts of 
the color feelings (n-components vector). Colors are 
represented by means of points in these spaces. There 
are lots of colors spaces and all of them start from the 
same concept, the Tri-chromatic theory of primary 
colors, red, green and blue. 

Dilation: The dilation of an image by a structuring 
element ' Y' is defined as the maximum value of all the 
pixels situated under the structuring element 



s Y (f)(x,y)= min f(x + s,y + t\ 

(s,t)eY 



The basic effect of this morphological operator the 
operator on a binary image is to gradually enlarge the 
boundaries of regions of foreground pixels (i.e. white 
pixels, typically). Thus areas of foreground pixels 
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grow in size while holes within those regions become 
smaller. 

Erosion: The basic effect of the operator on a bi- 
nary image is to reduce the definition of the objects. 
The erosion in the point (x,y) is the minimum value 
of all the points situated under the window, which is 
defined by the structuring element 'Y' that travels 
around the image: 



e Y ( f )<X y) = max f (* + s > y + 0. 

(S,t)GY 

Harris Corner Detector: A popular interest point 
detector (Harris and Stephens, 1988) due to its strong 
invariance to (Schmid, Mohr, & Bauckhage, 2000): 
rotation, scale, illumination variation and image 
noise. The Harris corner detector is based on the local 
auto-correlation function of a signal; where the local 
auto-correlation function measures the local changes 
of the signal with patches shifted by a small amount 
in different directions. 

Image Moments: (Hu, 1963; Mukundan and Ra- 
makrishman, 1998) they are certain particular weighted 
averages (moments) of the image pixels' intensities, 



or functions of those moments, usually chosen to have 
some attractive property or interpretation. They are 
useful to describe objects after segmentation. Simple 
properties of the image which are found via image 
moments include area (or total intensity), its centroid, 
and information about its orientation. 

Morphological Operators :(Haralick and Shapiro, 
1992; Vernon, 1991) Mathematical morphology is a 
set-theoretical approach to multi-dimensional digital 
signal or image analysis, based on shape. The signals 
are locally compared with so-called structuring elements 
of arbitrary shape with a reference point. 

Videometrics: (Tsai, 1987) can loosely be defined 
as the use of imaging technology to perform precise 
and reliable measurements of the environment. 



ENDNOTES 

1 http://www.sony.es/view/ShowProduct. 
action?product=DCR-C35E&site=odw_es_ES 
&pageType=Overview&category=CAM+Mini 
DV 

2 http://www.intel.com/technology/computing/ 
opencv/ 
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INTRODUCTION 

Survival analysis is used when we wish to study the 
occurrence of some event in a population of subjects 
and the time until the event of interest. This time is 
called survival time or failure time. Survival analysis 
is often used in industrial life-testing experiments and 
in clinical follow-up studies. Examples of application 
include: time until failure of a light bulb, time until 
occurrence of an anomaly in an electronic circuit, time 
until relapse of cancer, time until pregnancy. 

In the literature we find many different modeling 
approaches to survival analysis. Conventional para- 
metric models may involve too strict assumptions on 
the distributions of failure times and on the form of 
the influence of the system features on the survival 
time, assumptions which usually extremely simplify 
the experimental evidence, particularly in the case of 
medical data (Cox & Oakes, 1984). In contrast, semi- 
parametric models do not make assumptions on the 
distributions of failures, but instead make assumptions 
on how the system features influence the survival time 
(the usual assumption is the proportionality of hazards); 
furthermore, these models do not usually allow for direct 
estimation of survival times. Finally, non-parametric 
models usually only allow for a qualitative description 
of the data on the population level. 

Neural networks have recently been used for survival 
analysis; for a survey on the current use of neural net- 
works, and some previous attempts at neural network 



survival modeling we refer to (Bakker&Heskes, 1999), 
(Biganzoli et al., 1998), (Eleuteri et al., 2003), (Lisboa 
et al., 2003), (Neal, 2001), (Ripley & Ripley, 1998), 
(Schwarzer et al. 2000). 

Neural networks provide efficient parametric es- 
timates of survival functions, and, in principle, the 
capability to give personalised survival predictions. In 
a medical context, such information is valuable both 
to clinicians and patients. It helps clinicians to choose 
appropriate treatment and plan follow-up efficiently. 
Patients at high risk could be followed up more fre- 
quently than those at lower risk in order to channel 
valuable resources to those who need them most. For 
patients, obtaining information about their prognosis 
is also extremely valuable in terms of planning their 
lives and providing care for their dependents. 

In this article we describe a novel neural network 
model aimed at solving the survival analysis problem 
in a continuous time setting; we provide details about 
the Bayesian approach to modeling, and a sample ap- 
plication on real data is shown. 



BACKGROUND 

Let Tdenote an absolutely continuous positive random 
variable, with distribution function P, representing the 
time of occurrence of an event. The survival function, 
S(t), is defined as: 
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S(t)=Pr(T>t), 

that is, the probability of surviving beyond time t. We 
shall generally assume that the survival function also 
depends on a set of covariates, represented by the vector 
x (which can itself be assumed to be a random variable) . 
An important function related to the survival function 
is the hazard rate (Cox & Oakes, 1984), defined as: 

h r (t)=P'(t)/S(t) 

where P } is the density associated to P. The hazard 
rate can be interpreted as the instantaneous force of 
mortality. 

In many survival analysis applications we do not 
directly observe realisations of the random variable T; 
therefore we must deal with a missing data problem. The 
most common form of missingness is right censoring, 
i.e., we observe realisations of the random variable: 

Z=min(T,C), 

where Cis a random variable whose distribution is usu- 
ally unknown. We shall use a censoring indicator d to 
denote whether we have observed an event (d= 1) or not 
(d=0). It can be shown that inference does not depend 
on the distribution of C (Cox & Oakes, 1984). 

With the above definitions in mind we can now 
formulate the log-likelihood function necessary for 
statistical inference. We shall omit the details, and only 
report the analytical form: 

L = 2d,log^(( l ,A i )-Jfv(ii^)dii. 



For further details, we refer the reader to (Cox & 
Oakes, 1984). 



CONDITIONAL HAZARD ESTIMATING 
NEURAL NETWORKS 

Neural Network Model 

The neural network model we used is the Multi-Layer 
Perceptron (MLP) (Bishop, 1995): 



a(t,x;w) = b +£v k g(u[x + u t + b k ) 



where g() is a sigmoid function, and w={b Q , v, u, u Q , 
b} is the set of network parameters. The MLP output 
defines an analytical model for the logarithm of the 
hazard rate function: 

a(t,x;w) = logft r (t,x) 

We refer to this continuous time model as Condition- 
al Hazard Estimating Neural Network (CHENN). 

Bayesian Learning of the Network 
Parameters 

The Bayesian learning framework offers several 
advantages over maximum likelihood methods com- 
monly used in neural network learning (Bishop, 1995), 
(MacKay, 1992), among which the most important are 
automatic regularization and estimation of error bars 
on predictions. 

In the conventional maximum likelihood approach 
to training, a single weight vector is found, which 
minimizes the error function; in contrast, the Bayesian 
scheme considers a probability distribution overweights 
w. This is described by a prior distribution p(w) which 
is modified when we observe a dataset D. This process 
can be expressed by Bayes' theorem: 




p(w|D) = 



p(D|w)p(w) 
P(D) 



To evaluate the posterior distribution, we need 
expressions for the likelihood p(D\w) (which we have 
already shown) and for the prior p(w). 

The prior over weights should reflect the knowledge, 
if any, we have about the mapping we want to build. 
In our case, we expect the function to be very smooth, 
so an appropriate prior might be: 



p(w) oc exp 



^ k 



which is a multivariate normal density with zero mean and 
diagonal covariance matrix with elements l/ot^. In this 
way, weights centered on zero have higher probability, 
a fact which encourages very smooth functions. 
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Note that the prior is parametric, and the regular- 
ization parameters a k (which are inverse variances) 
are called hyperparameters, because they control the 
distribution of the network parameters. 

Note also that the prior is specialized for different 
groups of weights by using different regularization 
parameters for each group; this is done to preserve 
the scaling properties of network mappings (Bishop, 
1995), (MacKay, 1992). This prior is called Automatic 
Relevance Determination (ARD). This scheme defines 
a model whose prior over the parameters embodies the 
concept of relevance, so that the model is effectively 
able to infer which parameters are relevant based on the 
training data and then switch the others off (or at least 
reduce their influence on the overall mapping). 

The ARD modeling scheme in the case of the 
CHENN model defines weight groups for the inputs, 
the output layer weights, and the biases. 

Once the expressions for the prior and the noise 
model are given, we can evaluate the posterior: 



The approximation thus is: 



P(w|D) = — exp 



--Y.a, 



w, T w, 



This distribution is usually very complex and mul- 
timodal (reflecting the nature of the underlying error 
function, the term -L); and the determination of the 
normalization factor (also called the evidence) is very 
difficult. Furthermore, the hyperparameters must be 
integrated out, since they are only used to determine 
the form of the distributions. 

A solution is to integrate out the parameters sepa- 
rately from the hyperparameters, by making a Gaussian 
approximation; then, searching for the mode with re- 
spect to the hyperparameters (Bishop, 1995), (MacKay, 
1992). This procedure gives a good estimation of the 
probability mass attached to the posterior, in particular 
for distributions over high-dimensional spaces, which 
is the case for large networks. 

The Gaussian approximation is in practice derived 
by finding a maximum of the posterior distribution, 
and then evaluating the curvature of the distribution 
around the maximum: 



*-& 



w MP = arg max L - - ^a k w k w k , 
^ k 



A = VV 



-^+-Z a * 



p(w|D)«— exp 



Kw-Wmp^V-Wmp) 



where the normalisation constant is simply evaluated 
from usual multivariate normal formulas. 

The hyperparameters are calculated by finding the 
maximum of the approximate evidence Z Mp . Alternate 
maximization (by using a nonlinear optimization algo- 
rithm) of the posterior and evidence is repeated until a 
self consistent solution {w Mp , a k } is found. 

The full Bayesian treatment of inference implies that 
we do not simply get a pointwise prediction for func- 
tions/^;™^ of a model output, but a full distribution. 
Such predictive distributions have the form: 

p(f(x,t\D)) = jf(x,t\w)p(w\D)dw. 

The above integrals are in general not analytically 
tractable, even when the posterior distribution over the 
parameters is Gaussian. However, it is usually enough 
to find the moments of the predictive distribution, in 
particular its mean and variance. Ausef ul approximation 
is given by the delta method. Let f(w) be the function 
(of w) we wish to approximate. By Taylor expanding 
to first order around w Mp , we can write: 

f(w)« f(w MP ) + (w-w MP ) T V w f(w) w=WMp . 

Since this is a linear function of w, it will still be 
normally distributed under the Gaussian posterior, with 
mean and variance: 

E[f(w)]=f(w MP ) 
Var[f(w)] = V^fA- 1 Vf w . 

Error bars are simply obtained by taking the square 
root of the variance. We emphasize that it is impor- 
tant to evaluate first and second order information 
to understand the overall quality and reliability of a 
model's predictions. Error bars also provide hints on 
the distribution of the patterns (Williams et al., 1995) 
and can therefore be useful to understand whether a 
model is extrapolating its predictions. Furthermore, 
they can offer suggestions for the collection of future 
data (Williams et al., 1995; MacKay, 1992). 
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A Case Study: Ocular Melanoma 

We show now an application of the CHENN model to the 
prognosis of all-cause mortality in ocular melanoma. 

Intraocular melanoma occurs in a pigmented tis- 
sue called the uvea, with more than 90% of tumours 
involving the choroid, beneath the retina. About 50% 
of patients die of metastatic disease, which usually 
involves the liver. 

Estimates for survival after treatment of uveal 
melanoma are mostly derived and reported using Cox 
analysis and Kaplan-Meier (KM) survival curves (Cox 
& Oakes) . As a semiparametric model, the Cox method, 
however, usually utilizes linear relationships between 
variables, and the proportionality of risks is always as- 
sumed. It is therefore worth exploring the capability of 
nonlinear models, which do not make any assumptions 
about the proportionality of risks. 

The data used to test the model were selected from 
the database of the Liverpool Ocular Oncology Centre 
(Taktaketal., 2004). The dataset was splitinto two parts, 
one for training (1823 patterns), the other one for test 
(781 patterns). Nine prognostic factors were used: Sex, 
Tumour margin, Largest Ultrasound Basal Diameter, 
Extraocular extension, Presence of epithelioid cells, 
Presence of closed loops, Mitotic rate, Monosomy of 
chromosome 3. 

The performance of survival analysis models can in 
general be assessed according to their discrimination 
and calibration aspects. Discrimination is the ability of 
the model to separate correctly the subjects into differ- 
ent groups. Calibration is the degree of correspondence 
between the estimated probability produced by the 
model and the actual observed probability (Dreiseitl & 
Ohno-Machado, 2002). One of the most widely used 
methods for assessing discrimination in survival analy- 
sis is Harrell's C index (Dreiseitl & Ohno-Machado, 
2002), (Harrell et al. 1982), an extension to survival 
analysis of the Area Under the Receiver Operator 
Characteristic (AUROC). Calibration is assessed by a 
Kolmogorov-Smirnov (KS) goodness-of-fit test with 
corrections for censoring (Koziol, 1980). 

The C index was evaluated for a set of years that are 
of interest to applications, from 1 to 7. The minimum 
was achieved at 7 years (0.75), the maximum at 1 
year (0.8). The KS test with corrections for censoring 
was applied for the above set of years, and up to the 
maximum uncensored time (1 6.8 years); the confidence 
level was set as usual at 0.05. The null hypothesis that 



the modeled distributions follow the empirical estimate 
cannot be rej ected for years 1 to 7, whereas it is rej ected 
if we compare the distributions up to 16.8 years; the 
null hypothesis is always rejected for the Cox model. 



FUTURE TRENDS 

Neural networks are very flexible modelling tools, 
and in the context of survival analysis they can offer 
advantages with respect to the (usually linear) model- 
ling approaches commonly found in literature. This 
flexibility, however, comes at a cost: computational 
time and difficulty of interpretation of the model. The 
first aspect is due to the typically large number of 
parameters which characterise moderately complex 
networks, and the fact that the learning process results 
in a nonconvex, nonlinear optimization problem. 

The second aspect is in some way a result of the 
nonlinearity and nonconvexity of the model. Address- 
ing the issue of nonconvexity may be the first step to 
obtain models which can be easily interpreted in terms 
of their parameters, and easier to train; and in this re- 
spect, kernel machines (like Support Vector Machines) 
might be considered as the next step in flexible non- 
linear modelling, although the formulation of learning 
algorithms for these models follows a paradigm which 
is not based on likelihood functions, and therefore their 
application to survival data is not immediate. 



CONCLUSION 

This article proposes a new neural network model for 
survival analysis in a continuous time setting, which 
approximates the logarithm of the hazard rate function. 
The model formulation allows an easy derivation of error 
bars on both hazard rate and survival predictions. The 
model is trained in the Bayesian framework to increase 
its robustness and to reduce the risk of overfitting. The 
model has been tested on real data, to predict survival 
from intraocular melanoma. 

Formal discrimination and calibration tests have 
been performed, and the model shows good performance 
within a time horizon of 7 years, which is found useful 
for the application at hand. 

This project has been funded by the Biopattern 
Network of Excellence FP6/2002/IST/1; proposal N. 
IST-2002-508803; Project full title: Computational 
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eHealthcare; URL:www.biopattern.org 
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KEY TERMS 

Bayesian Inference: Inference rules which are 
based on application of Bayes' theorem and the basic 
laws of probability calculus. 

Censoring: Mechanism which precludes observa- 
tion of an event. A form of missing data. 

Hyperparameter: Parameter in a hierarchical 
problem formulation. In Bayesian inference, the pa- 
rameters of a prior. 

Neural Networks: A graphical representation of a 
nonlinear function. Usually represented as a directed 
acyclic graph. Neural networks can be trained to find 
nonlinear relationships in data, and are used in ap- 
plications such as robotics, speech recognition, signal 
processing or medical diagnosis. 

Posterior Distribution: Probabilistic representa- 
tion of knowledge, resulting from combination of prior 
knowledge and observation of data. 

Prior Distribution: Probabilistic representation of 
prior knowledge. 

Random Variable: Measurable function from 
a sample space to the measurable space of possible 
values of the variable. 
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Survival Analysis: Statistical analysis of data 
represented in terms of realisation of point events. In 
medical applications usually the point event is the death 
of an individual, or recurrence of a disease. 
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INTRODUCTION 

Configuring means selecting and bringing together a 
set of given components to produce an aggregate (or 
a set of aggregates) satisfying some requirements. 
All the component types are predefined and no new 
component type may be created during the configura- 
tion process. 

The result of the configuration can be physical ob- 
jects (such as cars or elevators), non-physical entities 
(such as compound services or processes) or heteroge- 
neous wholes made of both physical and non-physical 
parts (such as computer systems with their hardware 
and software components). 

The configuration process has to take into consider- 
ation both endogenous and exogenous constraints: the 
former pertain to the type of the assembled object(s) 
(therefore they hold for all the individuals of that 
type) and mainly come from the interactions among 
components, whereas the latter usually represent re- 
quirements that the final aggregate(s) should satisfy. 
All these constraints can be very complex and make 
the manual solution of configuration problems a very 
hard task in many cases. 

The complexity of configuration and its relevance 
in several application domains have stimulated the in- 
terest in its automation. Since the beginning, Artificial 
Intelligence has provided various effective techniques 
to achieve this goal. One of the first configurators was 
also one of the first commercially successful expert 
systems: a production rule-based system called Rl 
(McDermott, 1982, 1993). Rl was developed in the 
early Eighties to configure VAX computer systems, and 
it has been used for several years by Digital Equipment 
Corporation. 

Since then, configuration has gained importance 
both in industry and in marketing, also due to both the 
support that it offers to the mass customization busi- 
ness strategy and the new commercial opportunities 



provided by the Web. Configuration is currently an 
important application field for many Artificial Intelli- 
gence techniques and it is still posing many interesting 
problems to scientific research. 



BACKGROUND 

The increasing complexity and size of configurable 
products made it clear that production-rule-based 
configurators such as Rl are not effective, particularly 
in the phase of maintenance of knowledge bases. In 
fact, changing a rule may require, as a side effect, 
changing several other rules and so on, and, actually, 
for some products, the component library may change 
frequently. 

To partly address this problem, in current configura- 
tor systems, domain knowledge and control knowledge 
for problem solving are separate. The domain knowl- 
edge is represented in a declarative language, and the 
control knowledge (i.e., inferential mechanisms) is 
general (i.e., not depending on the particular problem 
to be solved). This is a common approach in modern 
knowledge-based systems. A configurator is based on 
an explicit representation of the general model of the 
configurable entities, which implicitly represents all the 
valid product individuals. The reasoning mechanisms 
implement the control knowledge and they use the 
domain knowledge to draw inferences and to compute 
configurations. 

Regarding domain knowledge, there is a general 
agreement about what the concepts to represent are. In 
(Soininen, Tiihonen, Mannisto & Sulonen, 1998) the 
authors introduce a widely accepted conceptualization 
for configuration problems. This conceptualization 
includes the concepts of 

components, which are the constituents of con- 
figurations; 
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parts to describe the compositional structure; 

ports to model connections and compatibilities 

between components; 

resources that are produced, used or consumed 

by components; 

functions to represent functionalities; 

attributes used to describe components, ports, 

resources and functions; 



taxonomies in which component, port, resource 
and function types may be organized in; 
constraints to specify conditions that configura- 
tions must satisfy. 

Figure 1 depicts a simplified fragment of the domain 
knowledge for PC configuration. It describes all the 
PC variants valid for the domain. Has-part relations 




Figure 1. A fragment of a PC configuration knowledge base 
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Constraints 

- In any PC, if there is any SCSI device, then there must be either a SCSI Main Printed Circuit Board or 
a SCSI Controller 

- In any PC, there must be no more than four EIDE devices and no more than fifteen SCSI devices 

- In any PC, the total hard disk space required by all the Operating Systems must be less than the size 
of hard disks 

- In any PC, the RAM required by each Operating System must be less than the available RAM amount 

- In any Motherboard, there cannot be both a SCS I Main Printed Circuit Board and a SCS I Controller 
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model the compositional structure of PCs (e.g., each 
PC has one or two monitors, a motherboard, etc.). 
Each component of a PC can be either of a basic (non 
configurable) type (e.g. , the monitor) or of an aggregate 
(possibly configurable) type (e.g., the motherboard). 
Some relevant taxonomic relations are reported (e.g., 
the hard disks are either SCSI or EIDE). The basic 
components can be connected through ports (only 
few ports are reported): each port connects with at 
most one other port; for some ports the connection is 



optional (e.g., for eide_port), for others it is mandatory 
(e.g., for device_eide_port). Some attributes (e.g., the 
price) describe the components. A set of constraints 
model the interactions among the components: e.g., 
the third constraint specifies that hard disks must pro- 
vide enough space, which is a resource consumed by 
operating systems. 

Figure 3 describes a particular PC variant, meeting 
the requirements stated in Figure 2 (containing also an 
optimization criterion on price). 



Figure 2. An example of user requirements for a PC 



Requirements 



The PC should have: 

- two SCSI Hard Disks and at least 160 GB of Hard Disk 

- atleast 1 GB RAM 

- a Wireless Keyboard 

- the cheapest price 



Figure 3. A configured PC, compliant with the domain knowledge in Figure 1 and meeting the requirements in 
Figure 2 
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AUTOMATIC CONFIGURATION 

Despite the consensus over the conceptualization of 
the problem, there is a wide range of approaches to 
configuration, with reference to different paradigms 
of Artificial Intelligence. It is possible to identify two 
mainstreams in current approaches to configuration: 
namely, constraint-based frameworks and logic-based 
frameworks. Constraint-based frameworks emphasize 
combinatorial aspects of configuration problems which 
have large search spaces and few solutions, while logic- 
based frameworks, in general, stress the description of 
the compositional structure of the product. 

As regards constraint-based frameworks, ap- 
proaches based on Constraint Satisfaction Problem 
(CSP) (Dechter, 2003) and its extensions are widely 
adopted. In particular, the classical CSP paradigm has 
been extended to overcome some of its limitations. In 
fact, on the one hand, in classical CSP, the set of vari- 
ables is fixed and they will all be assigned values in 
every solution. On the other hand, in the configuration 
task the number and the types of the components that 
will be part of the final valid configuration are usually 
not known in advance, since they are selected by the 
configurator during the configuration process. 

This fact motivated the introduction of Dynamic CSP 
(DCSP) (Mittal & Falkenhainer, 1 990) (Gelle & Sabin, 
2006) (also known as Conditional CSP) paradigm. A 
DCSP is defined - as classical CSP - on a fixed set of 
variables, but- differently from classical CSP- during 
the problem solving phase, it only takes into account 
the subset of variables relevant to the solution (i.e., 
the active variables). DCSP formalizes the notion of a 
particular type of constraint, i.e., activity constraints, 
which can add or remove variables from a potential 
solution depending on conditions imposed on already 
active variables. The search process starts with an ini- 
tial set of active variables, and additional variables are 
introduced (or explicitly left out) as search progresses, 
depending on satisfied activity constraints. 

A generalization of the original DCSP proposal, as 
well as some results on complexity and expressiveness, 
are presented in (Soininen, Gelle & Niemela, 1999). 
Several solving methods for DCSP are described and 
discussed in (Gelle & Sabin, 2003). 

In (Stumptner & Haselbock, 1993) the authors 
further extend DCSP by introducing the Generative 
CSP. In Generative CSP the types of the components 
may be compactly described and managed; moreover, 



generic constraints are defined: these are constraint 
schemata which can be instantiated on the specific 
variables activated at a particular point in the configu- 
ration process. 

In (Sabin & Freuder, 1996) the authors overcome 
a second major limitation of CSP with regard to con- 
figuration problems: in fact, CSP is "flat", i.e., it does 
not allow to represent the structure of a configuration 
product in a straightforward way. To overcome this 
limitation, Sabin and Freuder propose Composite CSP, 
an extension to CSP which allows one to take into ac- 
count not only changing sets of components, but also 
the hierarchical structure of the final configurations. 
In Composite CSP variables take not only atomic val- 
ues but also values representing entire subproblems. 
Whenever a variable is assigned with a subproblem 
value, the subproblem is "expanded" and the problem 
is dynamically modified: specifically, it is "refined" by 
considering also the variables and the constraints in the 
subproblem. In such a way, it is easy to adapt the CSP's 
inferential mechanisms to Composite CSP. 

Also classical CSP itself plays an important role 
in configuration. In fact, in (Aldanondo, Moynard & 
Hamou, 2000) an approach is presented that uses stand- 
ard CSP techniques to solve configuration problems. 
Moreover, several results in the configuration research 
field somehow refer to standard CSP framework. For 
example, in (Freuder, Likitvivatanavong & Wallace, 
2001) the authors explore the problem of generating 
explanations for configuration problems expressed 
as CSP. In (Freuder & O' Sullivan, 2001) the authors 
propose an approach for dealing with configuration 
problems expressed as CSP where it is not possible to 
satisfy all user requirements at the same time, and it is 
necessary to establish a satisfactory trade-off between 
them. (Amilhastre, Fargier & Marquis, 2002) extends 
CSP to offer support for interactive problem solving as 
in the case of interactive product configuration, where 
the interactivity refers to the user making choices during 
the configuration process. Specifically, the approach 
provides the user with features such as consistency 
maintenance (i.e., inconsistencies are discovered as soon 
as possible), consistency restoration (i.e., guidance for 
relaxing inconsistent choices) and explanations (i.e., 
minimal sets of inconsistent choices are identified). 
Finally, (Freuder, Carchrae & Beck, 2003) describes 
an approach for removing values of variables in a CSP 
that would lead to a dead-end in solving the CSP. 




399 



Configuration 



As regards logic-based frameworks, (McGuinness, 

2002) analyzes Description Logics (DL) (Baader, 
Calvanese, McGuinness, Nardi & Patel-Schneider, 

2003) as a convenient modeling tool for configurable 
products. DLmake possible a description of the configu- 
ration knowledge by means of expressive conceptual 
languages with a rigorous semantics, thus enhancing 
knowledge comprehensibility and facilitating knowl- 
edge re-use. Furthermore, the powerful inference 
mechanisms currently available can be exploited both 
off-line by the knowledge engineers and on-line by the 
configuration system. Moreover, the paper describes a 
commercial DL-based family of configurators devel- 
oped by AT&T. 

(Soininen, Niemela, Tiihonen & Sulonen, 2000) 
describes an approach in which the domain knowledge 
is represented with a high-level language and then 
mapped to a set of weight constraint rules, a form of 
logic programs offering support for expressing choices 
and both cardinality and resource constraints. Con- 
figurations are computed by finding stable Herbrand 
models of such a logic program. 

(Sinz, Kaiser & Kiichlin, 2003) presents an approach 
particularly geared to industrial context (in fact, it has 
been developed to be used by DaimlerChrysler for 
the configuration of their Mercedes lines). In Sinz et 
al.'s approach the domain knowledge is expressed as 
formulae in propositional logic; then, it is validated by 
running a satisfiability checker, which can also provide 
explanations in case of failure. However, this work aims 
at validating the knowledge base, rather than solving 
configuration problems. 

There are also hybrid approaches that recon- 
cile constraint-based frameworks and logic-based 
frameworks. For example, both (Magro & Torasso, 
2003) and (Junker & Mailharro, 2003) describe hybrid 
frameworks based on a logic-based description of the 
structure of the configurable product (taking inspira- 
tion from logical languages derived from frame-based 
languages such as the DL) and on a constraint-based 
description of the possible ways of interaction between 
components. 

In (Junker & Mailharro, 2003) constructs of DL are 
translated into concepts of constraint programming in 
order to solve a configuration problem. On the con- 
trary, (Magro & Torasso, 2003) adopts an inference 
mechanism specific for configuration, which, basically, 
searches for tree-structured models on finite domains 



for conceptual descriptions, and adapts some constraint- 
propagation techniques to the logical framework. 

In most formalizations, the configuration task is 
theoretically intractable (at least NP-hard, in the worst 
case) and in some cases the intractability does appear 
also in practice and solving configuration problems 
can require a huge amount of CPU time. There are 
several ways that can be explored to cope with these 
situations: providing the configurator with a set of 
domain-specific heuristics, defining general focusing 
mechanisms (Magro & Torasso, 2001), making use of 
compilation techniques (Sinz, 2002) (Narodytska & 
Walsh, 2006), re-using past solutions (Geneste & Ruet, 
2002), defining techniques to decompose a problem 
into a set of simpler subproblems (Magro, Torasso & 
Anselma, 2002) (Anselma & Magro, 2003). 

Configuration has a growing commercial market. In 
fact, several configurator systems have been developed 
and some commercial tools are currently available (e.g., 
ILOG (Junker & Mailharro, 2003), Koalog, Offerlt! 
(Bergenti, 2004), Oracle, SAP (Haag, 2005), TACTON 
(Orsvarn, 2005)). 

Furthermore, some Web sites have been equipped 
with configuration capabilities to support custom- 
ers in selecting a suitable product in a wide range of 
domains such as cars (e.g., Porsche, Renault, Volvo), 
bikes (e.g., Pro-M Bike Configurator) and computers 
(e.g., Dell, Cisco). 



FUTURE TRENDS 

Many current configuration approaches and software 
configuration systems concern the configuration of 
mechanical or electronic devices/products and are 
conceived in order to be employed by domain experts, 
such as production or sales engineers. 

Nowadays, the scope of configuration is growing and 
the application of automatic configuration techniques 
to non-physical entities is gaining more and more 
importance. The configuration of software products 
and complex services built on simpler ones are two 
research areas and application domains that are cur- 
rently attracting the attention of researchers. 

The capability of producing understandable explana- 
tions for their choices or for the inconsistencies that they 
encounter and of suggesting consistency restorations 
are some needs that configuration systems share with 
many knowledge-based or expert systems. However, 
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the aim of making configuration systems profitably 
used by non-expert users too and the deployment of 
configurators on the Web all contribute to strengthen 
the importance of these issues. 

Explanations and restorations are also related to the 
topic of interactive configuration, which is still posing 
some challenging problems to researchers. Indeed, 
besides these capabilities, interactive configuration 
requires also effective mechanisms to deal with incom- 
plete and/or incremental requirements specification 
(and with their retraction) and it is also demanding in 
terms of efficiency of the algorithms. 

Real-world configuration knowledge bases can be 
very large and they usually are continually modified dur- 
ing their life cycle. Some research efforts are currently 
devoted to define powerful techniques and to design and 
implement tools that support knowledge acquisition, 
knowledge-base verification and maintenance. 

Furthermore, a closer integration of configuration 
into the business models and of configurators into 
enterprise software systems is an important goal for 
several companies (as well as for enterprise software 
providers). 

Distributed configuration is another important topic, 
especially in an environment where a specific complex 
product/service is provided by different suppliers that 
have to cooperate in order to produce it. 

Finally, it is worth mentioning the reconfiguration 
of existing systems, which is still mainly an open 
problem. 



CONCLUSION 

Configuration has been a prominent area of Artificial 
Intelligence since the early Eighties, when it started to 
arouse interest among researchers working in academia 
and industry. 

This article provides a general overview of the area 
of configuration by introducing the problem of configu- 
ration, briefly presenting a general conceptualization 
of configuration tasks, and succinctly describing some 
representative proposals in literature to deal with con- 
figuration problems. 

As we have illustrated, during the last few years 
several approaches involving configuration techniques 
have been successfully applied in order to deal with 
issues pertaining to a wide range of real-world applica- 



tion domains, ranging from cars to computer systems, 
from software to travel plans. 

Theoretical results achieved by the academic en- 
vironment have found effective, tangible applications 
in industrial settings, thus contributing to the diffusion 
of both industrial and commercial configurators. Such 
applications - in their turn - gave rise to new challenges, 
engendering a significant cross-fertilization of ideas 
among researchers in academia and in industry. 
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KEY TERMS 

Constraint SatisfactionProblem(CSP):ACSPis 

defined by a finite set of variables, where each variable is 
associated with a domain, and a set of constraints over a 
subset of variables, restricting the possible combinations 
of values that the variables in the subset may assume. 
A solution of a CSP is an assignment of a value to each 
variable that is consistent with the constraints. 

Description Logics (DL): Logics that are designed 
to describe concepts and individuals in knowledge 
bases. They were initially developed to provide a precise 
semantics for the frame systems and the semantic net- 
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works. The typical inference for concepts is checking if 
a concept is more general than (i.e., subsumes) another 
one. The typical inference for individuals is checking 
if an individual is an instance of a concept. Many DL 
are fragments of first-order logic, while some of them 
go beyond first order. 

Logic Program: A logic theory (possibly contain- 
ing some extra-logic operators) that can be given a 
procedural meaning such that the process of checking 
if a formula is derivable in the theory can be viewed 
as a program execution. 

Mass Customization: A business strategy that 
combines the mass production paradigm with product 
personalization. It is closely related to the modularity 
in product design. This design strategy makes it pos- 
sible to adopt the mass production model for standard 
modules, facilitates the management of product families 
and variants and it leaves room for (various kinds and 
degrees of) personalization. 

Production-Rule-Based System: A system where 
knowledge is represented by means of production rules. 
A production rule is a statement composed of condi- 
tions and actions. If data in working memory satisfy 
the conditions, the related actions can be executed, 
resulting in an update of the working memory. 



Propositional Logic Formula Satisfiability: The 

task of checking whether it is possible to assign a truth 
value to every variable that occurs in a propositional 
formula, such that the truth value of the whole formula 
equals true. 

Stable Herbrand Model: A minimal set of facts 
satisfying a logic program (theory). Each fact in the 
model is a variable-free atom whose arguments are 
terms exclusively built through function and constant 
symbols occurring in the program and whose predicate 
symbols occur in the program as well. Facts not ap- 
pearing in the model are regarded as false. 
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INTRODUCTION 

Constraints appear in many areas of human endeavour 
starting from puzzles like crosswords (the words can 
only overlap at the same letter) and recently popular 
Sudoku (no number appears twice in a row) through 
everyday problems such as planning a meeting (the 
meeting room must accommodate all participants) till 
solving hard optimization problems for example in 
manufacturing scheduling (a job must finish before 
another j ob) . Though all these problems look like being 
from completely different worlds, they all share a similar 
base - the task is to find values of decision variables, 
such as the start time of the job or the position of the 
number at a board, respecting given constraints. This 
problem is called a Constraint Satisfaction Problem 
(CSP). 

Constraint processing emerged from AI research 
in 1970s (Montanary, 1974) when problems such as 
scene labelling were studied (Waltz, 1975). The goal 
of scene labelling was to recognize a type of line (and 
then a type of object) in the 2D picture of a 3D scene. 
The possible types were convex, concave, and occlud- 
ing lines and the combination of types was restricted 
at junctions of lines to be physically feasible. This 
scene labelling problem is probably the first problem 
formalised as a CSP and some techniques developed 
for solving this problem, namely arc consistency, are 
still in the core of constraint processing. Systematic 
use of constraints in programming systems has started 
in 1980s when researchers identified a similarity be- 
tween unification in logic programming and constraint 
satisfaction (Gallaire, 1985) (Jaffar & Lassez, 1987). 
Constraint Logic Programming was born. Today Con- 
straint Programming is a separate subject independent 
of the underlying programming language, though 
constraint logic programming still plays a prominent 
role thanks to natural integration of constraints into a 
logic programming framework. 

This article presents mainstream techniques for 
solving constraint satisfaction problems. These tech- 



niques stay behind the existing constraint solvers and 
their understanding is important to exploit fully the 
available technology. 



BACKGROUND 

Constraint Satisfaction Problem is formally defined as 
a triple: a finite set of decision variables, a domain of 
possible values, and a finite set of constraints restrict- 
ing possible combinations of values to be assigned 
to variables. Although the domain can be infinite, for 
example real numbers, frequently, a finite domain is 
assumed. Without lost of generality, the finite domain 
can be mapped to a set of integers which is the usual 
case in constraint solvers. This article covers finite 
domains only. In many problems, each variable has 
its own domain which is a subset of the domain from 
the problem definition. Such domain can be formally 
defined by a unary constraint. We already mentioned 
that constraints restrict possible combinations of 
values that the decision variables can take. Typically, 
the constraint is defined over a subset of variables, its 
scope, and it is specified either extensionally, as a set of 
value tuples satisfying the constraint, or intentionally, 
using a logical or arithmetical formula. This formula, 
for example A < B, then describes which value tuples 
satisfy the constraint. A small example of a CSP is ({A, 
B,C}, {1,2,3}, {A<B,B<C}). 

The task of constraint processing is to instantiate 
each decision variable by a value from the domain in 
such a way that all constraints are satisfied. This in- 
stantiation is called a feasible assignment. Clearly, the 
problem whether there exists a feasible assignment for 
a CSP is NP-complete - problems like 3SAT or knap- 
sack problem (Garey & Johnson, 1 979) can be directly 
encoded as CSPs. Sometimes, the core constraint satis- 
faction problem is accompanied by a so called obj ective 
function defined over (some) decision variables and we 
get a Constrained Optimisation Problem. Then the task 
is to select among the feasible assignments the assign- 
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ment that minimizes (or maximizes) the value of the 
objective function. This article focuses on techniques 
for finding a feasible assignment but these techniques 
can be naturally extended to optimization problems 
via a well-known branch-and-bound technique (Van 
Hentenryck, 1989). 

There are several comprehensive sources of informa- 
tion about constraint satisfaction starting from journal 
surveys (Kumar, 1 992) ( Jaffar & Maher, 1 996) through 
on-line tutorials (Bartak, 1998) till several books. Van 
Hentenryck's book (1989) was a pioneering work 
showing constraint satisfaction in the context of logic 
programming. Later Tsang's book (1993) focuses on 
constraint satisfaction techniques independently of the 
programming framework and it provides full technical 
details of most algorithms described later in this article. 
Recent books cover both theoretical (Apt, 2003) and 
practical aspects (Marriott & Stuckey, 1998), provide 
good teaching material (Dechter, 2003) or in-depth 
surveys of individual topics (Rossi et a/., 2006). We 
should not forget about books showing how constraint 
satisfaction technology is applied in particular areas; 
scheduling problems play a prominent role here 
(Baptiste et a/., 2001) because constraint processing 
is exceptionally successful in this area. 



CONSTRAINT SATISFACTION 
TECHNIQUES 

Constraint satisfaction problems over finite domains 
are basically combinatorial problems so they can be 
solved by exploring the space of possible (partial or 
complete) instantiations of decision variables. Later in 
this section we will present the typical search algorithms 
used in constraint processing. However, it should be 
highlighted that constraint processing is not simple 
enumeration and we will also show how so called 
consistency techniques contribute to solving CSPs. 

Systematic Search 

Search is a core technology of artificial intelligence and 
many search algorithms have been developed to solve 
various problems. In case of constraint processing we 
are searching for a feasible assignment of values to vari- 
ables where the feasibility is defined by the constraints. 
This can be done in a backtracking manner where we 
assign a value to a selected variable and check whether 



the constraints whose scope is already instantiated are 
satisfied. In the positive case, we proceed to the next 
variable. In the negative case, we try another value 
for the current variable or if there are no more values 
we backtrack to the last instantiated variable and try 
alternative values there. The following code shows the 
skeleton of this procedure called historically labelling 
(Waltz, 1975). Notice that the consistency check may 
prune domains of individual variables, which will be 
discussed in the next section. 

procedure labelling(V,D,C) 

if all variables from V are assigned then return V 
select not-yet assigned variable x from V 
for each value v from D x do 

(TestOK.D') <- "consistently, D,Cu{x=v}) 
if TestOK=true then 

R <- labelling(V,D\C) 
if R ± fail then return R 
end for 
return fail 

end labelling 

The above backtracking mechanism is parameter- 
ized by variable and value selection heuristics that 
decide about the order of variables for instantiation 
and about the order in which the values are tried. 
While value ordering is usually problem dependent 
and problem-independent heuristics are not frequently 
used due to their computational complexity, there are 
popular problem-independent variable ordering heuris- 
tics. Variable ordering is based on a so called first-fail 
principle formulated by Haralick and Eliot (1980) 
which says that the variable whose instantiation will 
lead to a failure with the highest probability should 
be tried first. A typical instance of this principle is a 
dom heuristic which prefers variables with the small- 
est domain for instantiation. There exist other popular 
variable ordering heuristics (Rossi et a/., 2006) such 
as dom+deg or dom/deg, but their detail description is 
out scope of this short article. 

Though the heuristics influence (positively) effi- 
ciency of search they cannot resolve all drawbacks of 
backtracking. Probably the main drawback is ignoring 
the information about the reason of constraint infeasi- 
bility. If the algorithm discovers that no value can be 
assigned to a variable, it blindly backtracks to the last 
instantiated variable though the reason of the conflict 
may be elsewhere. There exist techniques like back- 
jumping that can detect the variable whose instantiation 
caused the problem and backtrack (backjump) to this 
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variable (Dechter, 2003). These techniques belong to 
a broader class of intelligent backtracking that shares 
the idea of intelligent recovery from the infeasibility. 
Though these techniques are interesting and far be- 
yond simple enumeration, it seems better to prevent 
infeasibility rather than to recover from it (even in an 
intelligent way). 

Domain Filtering and Maintaining 
Consistency 

Assume variables A and B with domain {1, 2, 3} and 
a simple constraint A < B. Clearly, value 3 can never 
be assigned to A because there is no way to satisfy the 
constraint A < B if this value is used for A. Hence, 
this value can be safely removed from the domain of 
variable A and it does not need to be assumed during 
search. Similarly, value 1 can be removed from the 
domain of B. This process is called domain filtering 
and it is realised by a special procedure assigned to 
each constraint. Domain filtering is closely related to 
consistency of the constraint. We say that constraint C 
is (arc) consistent if for any value x in the domain of 
any variable in the scope of C there exist values in the 
domains of other variables in the scope of C such that 
the value tuple satisfies C. Such value tuple is called 
a support for x. Domain filtering attempts to make the 
constraint consistent by removing values which have 
no support. 

Domain filtering can be applied to all constraints 
in the problem to remove unsupported values from the 
domains of variables and to make the whole problem 
consistent. Because the constraints are interconnected, 
it may be necessary to repeat the domain filtering of a 
constraint C if another constraint pruned the domain of 
variable in the scope of C. Basically the domain filtering 
is repeated until a fixed point is reached which removes 
the largest number of unsupported values. There exist 
several procedures to realise this idea (Mackworth, 
1977), AC-3 schema is the most popular one: 

procedure AC-3(V,D,C) 

while non-empty Q do 
select c from Q 
D'^c.FILTER(D) 

if any domain in D' is empty then return (fail.D') 
Q^Qu {c'eC | axevar(c') D'x^Dx} - {c} 

end while 

return (true.D) 

end AC-3 
406 



We did not cover the details of the filtering procedure 
here. In the simplest way, it may explore the consistent 
tuples in the constraint to find a support for each value. 
There exist more advanced techniques that keep some 
information between the repeated calls to the filter and 
hence achieve better time efficiency (Bessiere, 1994). 
Frequently, the filtering procedure exploits semantics 
of the constraint to realise filtering faster. For example, 
filtering for constraint A < B can be realised by remov- 
ing from the domain of A all values greater than the 
maximal value of B (and similarly for B). 

Let us return our attention back to search. Even if 
we make all the constraints consistent, it does not mean 
that we obtained a solution. For example, the problem 
({A, B, C}, {1, 2, 3}, {A * B, B * C}) is consistent 
in the above-described sense, but it has no solution. 
Hence consistency techniques need to be combined 
with backtracking search to obtain a complete constraint 
solver. First, we make the constraints consistent. Then 
we start the backtracking search as described in the 
previous section and after each variable instantiation, 
we make the constraints consistent again. It may hap- 
pen that during the consistency procedure some domain 
becomes empty. This indicates inconsistency and we 
can backtrack immediately. Because the consistency 
procedure removes inconsistencies from the not yet 
instantiated variables, it prevents future conflicts dur- 
ing search. Hence this principle is called look ahead 
opposite to look back techniques that focus on recovery 
from discovered conflicts. The whole process is also 
called maintaining consistency during search and it 
can be realised by substituting the consistent procedure 
in labelling by the procedure AC-3. Figure 1 shows 
a difference between simple backtracking (top) and 
the look-ahead technique (bottom) when solving a 
well known 4-queens problem. The task is to allocate 
a queen to each column of the chessboard in such a 
way that no two queens attack each other. Notice that 
the look-ahead solved the method after four attempts 
while the simple backtracking is still allocating the 
first two queens. 

Clearly, the more inconsistencies one can remove, 
the smaller search tree needs to be explored. There exist 
stronger consistency techniques that assume several 
constraints together (rather that filtering each constraint 
separately, as we described above), but they are usually 
too computationally expensive and hence they are not 
used in each node of the search tree. Nevertheless, there 
also exists a compromise between stronger and efficient 
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Figure 1. Solving 4-queens problem using backtracking (top) and look-ahead (bottom) techniques; the crosses 
indicate positions forbidden by the current allocation of queens in the look-ahead method (values pruned by 
AC). 
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domain filtering called a global constraint. The idea is 
to encapsulate some well defined sub-problem into a 
single constraint (rather than a set of constraints) and 
then design a fast filtering algorithm for this constraint. 
A typical example of such global constraint is all-dif- 
ferent that encapsulates a set of binary inequalities 
between all pairs of variables and by using filtering 
based on matching in bipartite graphs, it achieves 
stronger pruning (Regin, 1994). Figure 2 demonstrates 
how a CSP with binary inequalities is converted into a 
bipartite graph, where matching indicates a consistent 
instantiation of variables. 

Global constraints represent a powerful mechanism 
how to integrate efficient solving algorithms into general 
framework of constraint satisfaction. There exist dozens 
of global constraints designed for particular application 
areas (Baptiste et a/., 2001) as well as general global 
constraints (Beldiceanu et a/., 2005). 



FUTURE TRENDS 

Constraint processing is a mature technology that 
goes beyond artificial intelligence and co-operates 
(and competes) with techniques from areas such as 
operations research and discrete mathematics. Many 
constraint satisfaction techniques including dozens of 
specialized as well as generic global constraints have 
been developed in recent years (Beldiceanu etal, 2005) 
and new techniques are coming. The technology trend 
is to integrate the techniques from different areas for 
co-operative and hybrid problem solving. Constraint 
processing may serve as a good base for such integration 
(as global constraints showed) but it can also provide 
solving techniques to be integrated in other frameworks 
such as SAT (satisfaction of logical formulas in a con- 
junctive normal form). This "hybridization and integra- 
tion" trend is reflected in new conferences, for example 



Figure 2. A graph representation of a constraint satisfaction problem with binary inequalities (left) and a bipartite 
graph representing the same problem in the all-different constraint (right). 
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CP-AI-OR (International Conference on Integration of 
AI and OR Techniques in Constraint Programming for 
Combinatorial Optimization Problems). 

The paradox of fast technology development is 
that the technology is harder to use by non-expert us- 
ers. There always exists several ways how to model 
a problem using constraints and though the models 
are equivalent concerning their soundness, they are 
frequently not equivalent concerning their efficiency. 
Although there are several rules of "good constraint 
modelling" (Marriott & Stuckey, 1998) (Bartak, 2005) 
there do not exist generally applicable guidelines for 
constraint modelling. Hence it is sometimes not that 
easy to design a model that is solvable (in a reason- 
able time) by available constraint solvers. So one of 
the most important challenges of constraint processing 
for upcoming years is to bring the technology back to 
masses by providing automated modelling and problem 
reformulation tools that will form a middleware between 
the constraint solvers and non-expert users and make 
the holly grail of programming - the user states the 
problem and the computer solves it - a reality. 



CONCLUSION 

This article surveyed mainstream constraint satisfaction 
techniques with the goal to give a compact background 
of the technology to people who would like to use these 
techniques for solving combinatorial optimisation 
problems. We simplified the techniques and terminol- 
ogy a bit to fit the scope of the article while keeping 
the core principles. It is important to understand that 
the presented techniques (and even more) are already 
available in existing constraint solvers such us ILOG CP 
library (www.ilog.com/products/cp), SICStus Prolog 
(www.sics.se/sicstus), ECLiPSe (eclipse.crosscoreop. 
com), Mozart (www.mozart-oz.org), Choco (choco- 
solver.net) and other systems so the users are not 
required to program them from scratch. Nevertheless, 
understanding the underlying principles is important 
for design of efficient constraint models that can be 
solved by these systems. Constraint processing did not 
reach the holy grail of programming yet but it is going 
fast towards this goal. 
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KEY TERMS 

Constrained Optimisation Problem (COP): A 

Constraint Satisfaction Problem extended by an objec- 
tive function over the (subset of) decision variables. The 
task is to find a solution to the CSP which minimizes 
or maximizes the value of the objective function. 

Constraint: Any relation between a subset of deci- 
sion variables. Can be expressed extensionally, as a set 
of value tuples satisfying the constraint, or intentionally, 
using an arithmetical or logical formula between the 
variables, for example A+B < C. 

Constraint Satisfaction Problem (CSP): A prob- 
lem formulated using a set of decision variables, their 
domains, and constraints between the variables. The 
task is to find an instantiation of decision variables by 
values from their domains in such a way that all the 
constraints are satisfied. 



Consistency Techniques: Techniques that remove 
inconsistent values (from variables' domains) or value 
tuples, that is, the values that cannot be assigned to a 
given variable in any solution. Arc consistency is the 
most widely used consistency technique. 

Decision Variable: A variable modelling some 
feature of the problem, for example a start time of 
activity, whose value we are looking for in such a way 
that specified constraints are satisfied. 

Domain of Variable: A set of possible values that 
can be assigned to a decision variable, for example a 
set of times when some activity can start. Constraint 
processing usually assumes finite domains only. 

Domain Pruning (Filtering): A process of remov- 
ing values from domains of variables that cannot take 
part in any solution. Usually, due to efficiency issues 
only the values locally violating some constraint are 
pruned. It is the most common type of consistency 
technique. 

Global Constraint: An n-ary constraint model- 
ling a subset of simpler constraints by providing a 
dedicated filtering algorithm that achieves stronger or 
faster domain pruning in comparison to making the 
simpler constraints (locally) consistent. All-different 
is an example of a global constraint. 

Look Ahead : The most common technique for inte- 
grating depth-first search with maintaining consistency. 
Each time a search decision is done, it is propagated in 
the problem model by making the model consistent. 

Search Algorithms: Algorithms that explore the 
space of possible (partial or complete) instantiations 
of decision variables with the goal to find an instantia- 
tion satisfying all the constraints (and optimizing the 
objective function in case of COP). 
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INTRODUCTION 



BACKGROUND 



The effective capacity of inter-urban motorway net- 
works is an essential component of traffic control and 
information systems, particularly during periods of 
daily peak flow. However, slightly inaccurate capacity 
predictions can lead to congestion that has huge social 
costs in terms of travel time, fuel costs and environment 
pollution. Therefore, accurate forecasting of the traffic 
flow during peak periods could possibly avoid or at 
least reduce congestion. Additionally, accurate traffic 
forecasting can prevent the traffic congestion as well 
as reduce travel time, fuel costs and pollution. 

However, the information of inter-urban traffic pres- 
ents a challenging situation; thus, the traffic flow fore- 
casting involves a rather complex nonlinear data pattern 
and unforeseen physical factors associated with road 
traffic situations. Artificial neural networks (ANNs) are 
attracting attention to forecast traffic flow due to their 
general nonlinear mapping capabilities of forecasting. 
Unlike most conventional neural network models, 
which are based on the empirical risk minimization 
principle, support vector regression (SVR) applies the 
structural risk minimization principle to minimize an 
upper bound of the generalization error, rather than 
minimizing the training errors. SVR has been used to 
deal with nonlinear regression and time series prob- 
lems. This investigation presents a short-term traffic 
forecasting model which combines SVR model with 
continuous ant colony optimization (SVRCACO), to 
forecast inter-urban traffic flow. Anumerical example of 
traffic flow values from northern Taiwan is employed to 
elucidate the forecasting performance of the proposed 
model. The simulation results indicate that the proposed 
model yields more accurate forecasting results than 
the seasonal autoregressive integrated moving average 
(SARIMA) time-series model. 



Traditionally, there has been a wide variety of fore- 
casting approaches applied to forecast the traffic flow 
of inter-urban motorway networks. Those approaches 
could be classified according to the type of data, fore- 
cast horizon, and potential end-use (Dougherty, 1 996); 
including historical profiling (Okutani & Stephanedes, 
1984), state space models (Stathopoulos & Karlafits, 
2003), Kalman filters (Whittaker, Garside & Lindveld, 
1994), and system identification models (Vythoulkas, 
1993). However, traffic flow data are in the form of 
spatial time series and are collected at specific locations 
at constant intervals of time. The above-mentioned 
studies and their empirical results have indicated that 
the problem of forecasting inter-urban motorway traf- 
fic flow is multi-dimensional, including relationships 
among measurements made at different times and 
geographical sites. In addition, these methods have 
difficultly coping with observation noise and missing 
values while modeling. Therefore, Danech-Pajouh and 
Aron (1991) employed a layered statistical approach 
with a mathematical clustering technique to group the 
traffic flow data and a separately tuned linear regression 
model for each cluster. Based on the multi-dimensional 
pattern recognition requests, such as intervals of time 
and geographical sites, non-parametric regression 
models (Smith, Williams & Oswald, 2002) have also 
successfully been employed to forecast motorway traf- 
fic flow. The ARIMA model and extended models are 
the most popular approaches in traffic flow forecasting 
(Kamarianakis & Prastacos, 2005) (Smith et al., 2002). 
Due to the stochastic nature and the strongly nonlinear 
characteristics of inter-urban traffic flow data, the arti- 
ficial neural networks (ANNs) models have received 
much attention and been considered as alternatives for 
traffic flow forecasting models (Ledoux, 1997) (Yin, 
Wong, Xu & Wong, 2002). However, the training pro- 
cedure of ANNs models is not only time consuming 
but also possible to get trapped in local minima and 
subjectively in selecting the model architecture. 
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Thus, SVR have been successfully employed to 
solve forecasting problems in many fields. Such as 
financial time series (stocks index and exchange rate) 
forecasting (Pai & Lin, 2005) (Pai, Lin, Hong & Chen, 
2006), engineering and software field (production values 
and reliability) forecasting (Hong & Pai, 2006) (Pai & 
Hong, 2006), atmospheric science forecasting (Hong & 
Pai, 2007) (Mohandes, Halawani, Rehman & Hussain, 
2004), and so on. Meanwhile, SVR model had also 
been successfully applied to forecast electric load (Pai 
& Hong, 2005a) (Pai & Hong, 2005b). The practical 
results indicated that poor forecasting accuracy is suf- 
fered from the lack of knowledge of the selection of the 

three parameters (a, C, and S) in a SVR model. 

In this investigation, one of evolutionary algorithms, 
the ant colony optimization (ACO), is tried to determine 
the values of three parameters in a SVR traffic flow 
model in Panchiao city of Taipei County, Taiwan. In 
addition, as being developed for discrete optimization, 
the application of ACO to continuous optimization 
problems requires the transformation of a continuous 
search space to a discrete one by discretization of the 
continuous decision variables, which procedure is so- 
called CACO. 



R(0 = Cjt L e(*i>fi) + \ 



w 




(2) 



where 



Ma>0 = 



if \a-f\<e 
otherwise 



(3) 



In addition, L e (a, f ) is employed to find out an 
optimum hyper plane on the high dimensional feature 
space to maximize the distance separating the training 
data into two subsets. Thus, the SVR focuses on finding 
the optimum hyper plane and minimizing the training 
error between the training data and the e-insensitive 
loss function (as thick line in Fig. 1(c)). 

Minimize: 



i?(w,^*) = i|w| 2 4 



( N 



(4) 



MAIN FOCUS OF THE CHAPTER 

In this article, two models, the seasonal ARIMA 
(SARIMA) model and the SVRCACO model, are 
used to compare the forecasting performance of traf- 
fic flow. 

Support Vector Regression (SVR) Model 

The basic concept of the SVR is to map nonlinearly 
the original data x into a higher dimensional feature 
space. Hence, given a set of data G = {(x^a,.)}^ (where 
x. is the input vector; a. is the actual value, and N is 
the total number of data patterns), the SVM regression 
function is: 



f =g(x) = w>(x z ) + b 



(1) 



where (|)(x.) is the feature of inputs (to map the input 
data into a so-called high dimensional feature space, see 
Fig. 1 (a) and (b)), and both w and b are coefficients. 
The coefficients (w and b) are estimated by minimizing 
the following regularized risk function 



with the constraints, 
wc|)(x.) + b-a. < e + t* 

a. -wc|)(x.)-b<e+^ 

z = l,2,...,N 

The first term of Eq. (5), employed the concept of 
maximizing the distance of two separated training data, 
is used to regularize weight sizes, to penalize large 
weights, and to maintain regression function flatness. 
The second term penalizes training errors of forecasting 
values and actual values by using the e-insensitive loss 
function. C is a parameter to trade off these two terms. 
Training errors above 8 are denoted as £*, whereas 
training errors below 8 are denoted as £,,. 

After the quadratic optimization problem with 
inequality constraints is solved, the weight w in Eq. 
(2) is obtained, 

w*=X(P I -p;)K(x,x I .) (5) 
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Figure 1. Transformation process illustration of a SVR model 




Input space 



Feature space 



£ -insensitive loss function 



Hence, the regression function is Eq. (6): 
ff(x,p,p*) = J(p l -p;)K(x,x l ) + b (6) 

i=l 

Here, K(x., x) is called the Kernel function. The 
value of the Kernel equals the inner product of two 
vectors, x.and x., in the feature space (|)(x.) and (|)(x.); 
that is, K(x., x ) = (|>(x ) * (|)(x ). The Gaussian RBF ker- 
nel is not only easier to implement, but also capable 
to nonlinearly map the training data into an infinite 
dimensional space, thus, it is suitable to deal with 
nonlinear relationship problems. In this work, the 

II II 2 / 2 

Gaussian function, exp(-||x - x t \\ /2a ), is used in the 
SVR model. 

CACO in Selecting Parameters of the 
SVR Model 

Ant colony optimization algorithms (Dorigo, 1992) 
have been successfully used to dealing with combinato- 
rial optimization problems such as j ob-shop scheduling 
(Colorni, Dorigo, Maniezzo & Trubian, 1994), travel- 
ing salesman problem (Dorigo & Gambardella, 1997), 
space-planning (Bland, 1999), quadratic assignment 
problems (Maniezzo & Colorni, 1 999), and data mining 
(Parpinelli, Lopes & Freitas, 2002). ACO imitates the 
behaviors of real ant colonies as they forage for food, 
wherein each ant lays down the pheromone on the path 
to the food sources or back to the nest. The paths with 
more pheromone are more likely to be selected by 
other ants. Over time, a colony of ants will select the 
shortest path to the food source and back to the nest. 
Therefore, a pheromone trail is the most important 
process for individual ant to smell and select its route. 



The probability, P k (i,j), that an ant k moves from city 
z to cityy is expressed as Eq. (7), 



P k (Uj) 



_Jargmax{t(/,S)] a E 1 (/,S)] p } , if q < q Q 



Eq.(9) 



, otherwise 
(7) 



, otherwise 



(8) 



where x(ij) is the pheromone level between city z and 
cityy, r\(i,j) is the inverse of the distance between cities 
z andy. In this study, the forecasting error represents 
the distance between cities. The a and p are parameters 
determining the relative importance of pheromone level 
and M k is a set of cities in the next column of the city 
matrix for ant k. q is a random uniform variable [0,1] 
and the value q is a parameter. The values of a, p and 
g are set to be 8, 5 and 0.2 respectively. 

Once ants have completed their tours, the most 
pheromone deposited by ants on the visited paths 
is considered as the information regarding the best 
paths from the nest to the food sources. Therefore, the 
pheromone dynamic updating plays the main role in 
real ant colonies searching behaviors. The local and 
global updating rules of pheromone are expressed as 
Eq.(9) and Eq(10) respectively. 

x(z,y) = (l-pMU) + PT ( 9 ) 



t(z,7) = (1-8)t(z,7) + 8At(z,7) 



(10) 
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Vl , 


if (/, j) € global best route 


, 


otherwise 




(11) 



where p is the local evaporation rate of pheromone, < 
p < 1; x is the initial amount of pheromone deposited 
on each of the paths. In this work, the values of p and x 
are set to be 0.01 and 1 correspondingly. In addition, the 
approach proposed by Dorigo and Gambardella (1994) 
was employed here for generating the initial amount 
of pheromone. Global trail updating is accomplished 
according Eq.(10). The 8 is the global pheromone decay 
parameter, < 8 < 1, and set to be 0.2 for this study. 
The Ax (z,y), expressed as Eq.(ll), is used to increase 
the pheromone on the path of the solution. 



Ax (z,y): 



where L is the length of the shortest route. 

A Numerical Example and Experimental 
Results 

The traffic flow data sets were originated from three 
Civil Motorway detector sites. The Civil Motorway is 
the busiest inter-urban motorway networks in Panchiao 
city, the capital of Taipei County, Taiwan. The major 
site was located at the center of Panchiao City, where 
the flow intersects an urban local street system, and 
it provided one way traffic volume for each hour in 
weekdays. Therefore, one way flow data for peak traffic 
are employed in this investigation, which includes the 
morning peak period (MPP; from 6:00 to 10:00) and 
the evening peak period (EPP; from 16:00 to 20:00). 
The data collection is conducted from February 2005 to 
March 2005, the number of traffic flow data available 
for MPP and EPP are 45 and 90 hours, respectively. 
For convenience, the traffic flow data are converted to 
equivalent of passengers (EOP), and both of these two 
peak periods show the seasonality of traffic data. In 
addition, traffic flow data are divided into three parts: 
training data (MPP 25 hours; EPP 60 hours), valida- 
tion data (MPP 10 hours; EPP 15 hours) and testing 
data (MPP 10 hours; EPP 15 hours). The accuracy of 
forecasting models is measured by the normalized root 
mean square error (NRMSE), as given by Eq.(12). 



NRMSE=\±{a i -f i f ±a> 



(12) 



where n is the number of forecasting periods; a is the 
actual traffic flow value at period z; and f. is the fore- 
casting traffic flow value at period z. 

The parameter selection of forecasting models is 
important for obtaining good forecasting performance. 
For the SARIMAmodel, the parameters are determined 
by taking the first-order regular difference and first 
seasonal difference to remove non-stationary and sea- 
sonality characteristics. Using statistical packages, with 
no residuals autocorrelated and approximately white 
noise residuals, the most suitable models for these two 
morning/evening peak periods for the traffic data are 

SARIMA(1,0,1) x (0,1,1) 5 with non-constant item and 

SARIMA(1,0,1) x (1,1,1) 5 with constant item, respec- 
tively. The equations used for the SARIMA models 
are presented as Eqs. (13) and (14), respectively. 

(l-0.5167B)(l-B 5 )X t =(l+0.3306B)(l-0.9359B 5 )8 t 

(13) 

(l-0.5918B)(l-B 5 )X t =2.305 + (l-0.9003B 5 )e t 

(14) 

For the S VRC ACO model, a rolling-based forecast- 
ing procedure was conducted and a one-hour-ahead 
forecasting policy adopted. Then, several types of 
data-rolling are considered to forecast traffic flow 
in the next hour. In this investigation, the CACO is 
employed to determine suitable combination of the 
three parameters in a SVR model. Parameters of the 
S VRC ACO models with the minimum testing NRMSE 
values were selected as the most suitable model for this 
investigation. Table 1 indicates that S VRC ACO models 
perform the best when 15 and 35 input data are used for 
morning/evening traffic forecast respectively. Table 2 
compares the forecasting accuracy of the SARIMA and 
S VRC ACO models in terms of NRMSE. It is illustrated 
that SVRCACO models have better forecasting results 
than the SARIMA models. 



FUTURE TRENDS 

In this investigation, the SVRCACO model provides 
a convenient and valid alternative for traffic flow fore- 
casting. The SVRCACO model directly uses historical 
observations from traffic control systems and then 
determines suitable parameters by efficient optimiza- 
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Table 1. Forecasting results and associated parameters of the SVRCACO models 



Morning peak period 


Evening peak period 


Nos. of 
input 
data 


Parameters 
a C 8 


NRMSE 

of testing 


Nos. of 
input 
data 


a 


Parameters 
C 


s 


NRMSE 

of testing 


5 


0.7286 2149.2 


0.8249 


0.3965 


25 


0.7277 


9658.9 


0.4176 


0.1112 


10 


0.7138 1199.0 


0.1460 


0.3464 


30 


0.9568 


9337.7 


0.7741 


0.1037 


15 


0.7561 2036.5 


0.9813 


0.2632 


35 


0.8739 


6190.2 


0.7619 


0.1033 


20 


0.6858 2141.6 


0.4724 


0.2754 


40 


0.1528 


6300.5 


0.8293 


0.1147 










45 


0.5093 


3069.9 


0.7697 


0.1077 










50 


0.1447 


8835.7 


0.8616 


0.1247 










55 


0.5798 


6299.1 


0.5796 


0.1041 



Table 2. Forecasting results (unit: EOT) 



Morning peak period 


Evening peak period 


Peak 
periods 


Actual 


SARIMA 


SVRCACO 


Peak 
periods 


Actual 


SARIMA 


SVRCACO 


031106 


1,317.5 


1,363.77 


2,190.9 


031016 


2,310.5 


2,573.84 


2,229.1 


031107 


2,522.0 


2,440.11 


2,027.4 


031017 


2,618.0 


2,821.57 


2,319.9 


031108 


2,342.0 


2,593.91 


2,140.4 


031018 


2,562.0 


3,107.01 


2,300.8 


031109 


2,072.0 


2,422.09 


2,313.7 


031019 


2,451.5 


3,103.66 


2,571.6 


031110 


1,841.5 


2,459.87 


2,053.4 


031020 


2,216.5 


3,011.80 


2,447.2 


031206 


995.5 


1,578.34 


1,980.6 


031116 


2,175.5 


2,611.58 


2,432.4 


031207 


1,457.0 


2,569.92 


1,704.1 


031117 


2,577.0 


2,859.31 


2,169.4 


031208 


1,899.0 


2,690.35 


1,548.3 


031118 


2,879.5 


3,144.75 


2,450.4 


031209 


1,870.5 


2,505.38 


1,521.0 


031119 


2,693.0 


3,141.40 


2,598.4 


031210 


2,151.5 


2,537.98 


1,881.1 


031120 


2,640.0 


3,049.54 


2,671.9 










031216 


2,146.5 


2,649.32 


2,628.5 










031217 


2,544.5 


2,897.05 


2,633.1 










031218 


2,873.0 


3,182.49 


2,538.0 










031219 


2,567.5 


3,179.13 


2,670.8 










031220 


2,660.5 


3,087.28 


2,562.7 


NRMSE 




0.3039 


0.2632 


NRMSE 




0.1821 


0.1033 



*: "031106" denotes the 6 o 'clock on 11 March 2005, and so on. 
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tion algorithms. In future research, other factors and 
meteorological control variables during peak periods, 
such as driving speed limitation, important social 
events, the percentage of heavy vehicles, bottleneck 
service level and waiting time during intersection 
traffic signals can be included in the traffic forecasting 
model. In addition, some other advanced optimization 
algorithms for parameters selection can be applied for 
the SVR model to satisfy the requirement of real-time 
traffic control systems. 



CONCLUSION 

Accurate traffic forecast is crucial for the inter-urban 
traffic control system, particularly for avoiding con- 
gestion and for increasing efficiency of limited traffic 
resources during peak periods. The historical traffic data 
of Panchiao City in northern Taiwan shows a seasonal 
fluctuation trend which occurs in many inter-urban 
traffic systems. Therefore, over-prediction or under-pre- 
diction of traffic flow influences the transportation ca- 
pability of an inter-urban system. This study introduces 
the application of forecasting techniques, SVRCACO, 
to investigate its feasibility for forecasting inter-urban 
motorway traffic. This article indicates that the SVR- 
CACO model has better forecasting performance than 
the SARIMA model. The superior performance of the 
SVRCACO model is due to the generalization ability 
of SVR model for forecasting and the proper selection 
of SVR parameters by CACO. 
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KEY TERMS 

Ant Colony Optimization Algorithm (ACO): 

inspired by the behavior of ants in finding paths from 
the colony to food, is a probabilistic technique for solv- 
ing computational problems which can be reduced to 
416 



finding good paths through graphs. A short path gets 
marched over faster, and thus the pheromone density 
remains high as it is laid on the path as fast as it can 
evaporate. 

Artificial Neural Networks (ANNs): A network 
of many simple processors ("units" or "neurons") that 
imitates a biological neural network. The units are 
connected by unidirectional communication channels, 
which carry numeric data. 

Autoregressive Integrated Moving Average 

(ARIMA): A generalization of an autoregressive mov- 
ing average (ARMA) model. These models are fitted 
to time series data either to better understand the data 
or to predict future points in the series. The model is 
generally referred to as an ARIMA(p,d,g) model where 
p, d, and q are integers greater than or equal to zero and 
refer to the order of the autoregressive, integrated, and 
moving average parts of the model respectively. 

Evolutionary Algorithm (EA): is a generic popu- 
lation-based meta-heuristic optimization algorithm. 
An EA uses some mechanisms inspired by biological 
evolution: reproduction, mutation, recombination, 
natural selection and survival of the fittest. Evolutionary 
algorithms consistently perform well approximating 
solutions to all types of problems because they do not 
make any assumption about the underlying fitness 
landscape. 

Pheromone: A pheromone is a chemical that trig- 
gers an innate behavioral response in another member 
of the same species. There are alarm pheromones, food 
trail pheromones, sex pheromones, and many others 
that affect behavior or physiology. In this article, food 
trail pheromones are employed, which are common in 
social insects. 

Seasonal Autoregressive Integrated Moving 
Average (S ARIMA): A kind of ARIMA model to 
conduct forecasting problem while seasonal effect 
is suspected. For example, consider a model of daily 
road traffic volumes. Weekends clearly exhibit differ- 
ent behavior from weekdays. In this case it is often 
considered better to use a S ARIMA (seasonal ARIMA) 
model than to increase the order of the AR or MA parts 
of the model. 

Support Vector Machines (SVMs): Support vector 
machines (SVMs) were originally developed to solve 
pattern recognition and classification problems. With 
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the introduction of Vapnik's e-insensitive loss function, 
S VMs have been extended to solve nonlinear regression 
estimation problems which are so-called support vec- 
tor regression (SVR). SVR applies the structural risk 
minimization principle to minimize an upper bound of 
the generalization error. SVR has been used to deal with 
nonlinear regression and time series problems. 
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INTRODUCTION 

Data mining is the process of extracting previously 
unknown information from large databases or data 
warehouses and using it to make crucial business deci- 
sions. Data mining tools find patterns in the data and 
infer rules from them. The extracted information can 
be used to form a prediction or classification model, 
identify relations between database records, or provide a 
summary of the databases being mined. Those patterns 
and rules can be used to guide decision making and 
forecast the effect of those decisions, and data mining 
can speed analysis by focusing attention on the most 
important variables. 



BACKGROUND 

We are drowning in data, but starving for knowledge. 
In recent years the amount or the volume of information 
has increased significantly. Some researchers suggest 
that the volume of information stored doubles every 
year. Disk storage per person (DSP) is a way to measure 
the growth in personal data. Edelstein (2003) estimated 
that the number has dramatically grown from 28MB 
in 1996 to 472MB in 2000. 

Data mining seems to be the most promising solution 
for the dilemma of dealing with too much data having 
very little knowledge. By using pattern recognition tech- 
nologies and statistical and mathematical techniques 
to sift through warehoused information, data mining 
helps analysts recognize significant facts, relation- 
ships, trend, patterns, exceptions and anomalies. The 
use of data mining can advance a company's position 
by creating a sustainable competitive advantage. Data 



warehousing and mining is the science of managing 
and analyzing large datasets and discovering novel 
patterns (Davenport & Harris, 2007; Wang, 2006; 
Olafsson, 2006). 

Data mining is taking off for several reasons: organi- 
zations are gathering more data about their businesses, 
the enormous drop in storage costs, competitive busi- 
ness pressures, a desire to leverage existing information 
technology investments, and the dramatic drop in the 
cost/performance ratio of computer systems. Another 
reason is the rise of data warehousing. In the past, it 
was often necessary to gather the data, cleanse it, and 
merge it. Now, in many cases, the data are already 
sitting in a data warehouse ready to be used. 

Over the last 40 years, the tools and techniques to 
process data and information have continued to evolve 
from data bases to data warehousing and further to data 
mining. Data warehousing applications have become 
business-critical. Data mining can compress even more 
value out of these huge repositories of information. 
Data mining is a multidisciplinary field covering a lot 
of disciplines such as databases, statistics, artificial 
intelligence, pattern recognition, machine learning, 
information theory, control theory, operations research, 
information retrieval, data visualization, high-perfor- 
mance computing or parallel and distributed computing, 
etc (Zhou, 2003; (Hand, Mannila, & Smyth, 2001). 

Certainly, many statistical models had emerged a 
long time ago. Machine learning has marked a mile- 
stone in the evolution of computer science. Although 
data mining is still in its infancy, it is now being used 
in a wide range of industries and for a range of tasks in 
a variety of contexts (Wang, 2003; Lavoie, Dempsey, 
& Connaway, 2006). Data mining is synonymous with 
knowledge discovery in databases, knowledge extrac- 
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tion, data/pattern analysis, data archeology, data dredg- 
ing, data snooping, data fishing, information harvesting, 
and business intelligence (Han and Kamber, 2001). 



MAIN FOCUS 
Functionalities and Tasks 

The common types of information that can be de- 
rived from data mining operations are associations, 
sequences, classifications, clusters, and forecasting. 
Associations happen when occurrences are linked in 
a single event. One of the most popular association 
applications deals with market basket analysis. This 
technique incorporates the use of frequency and prob- 
ability functions to estimate the percentage chance of 
occurrences. Business strategists can leverage off of 
market basket analysis by applying such techniques 
as cross-selling and up-selling. In sequences, events 
are linked over time. This is particularly applicable 
in e-business for Website analysis. 

Classification is probably the most common data 
mining activity today. It recognizes patterns that 
describe the group to which an item belongs. It does 
this by examining existing items that already have 
been classified and inferring a set of rules from them. 
Clustering is related to classification, but differs in that 
no groups have yet been defined. Using clustering, 
the data-mining tool discovers different groupings 
within the data. The resulting groups or clusters help 
the end user make some sense out of vast amounts of 
data (Kudyba, & Hoptroff, 2001). All of these appli- 
cations may involve predictions. The fifth application 
type, forecasting, is a different form of prediction. It 
estimates the future value of continuous variables based 
on patterns within the data. 

Algorithms and Methodologies 

Neural Networks 

Also referred to as artificial intelligence (AI), neural 
networks utilize predictive algorithms. This technology 
has many similar characteristics to that of regression 
because the application generally examines historical 
data, and utilizes a functional form that best equates 
explanatory variables and the target variable in a man- 



ner that minimizes the error between what the model 
had produced and what actually occurred in the past, 
and then applies this function to future data. Neural 
networks are a bit more complex as they incorporate 
intensive program architectures in attempting to iden- 
tify linear, non-linear and patterned relationships in 
historical data. 

Decision Trees 

Megaputer (2006) mentioned that this method can be 
applied for solution of classification tasks only. As 
a result of applying this method to a training set, a 
hierarchical structure of classifying rules of the type 
"if... then..." is created. This structure has a form of 
a tree. In order to decide to which class an object or a 
situation should be assigned one has to answer ques- 
tions located at the tree nodes, starting from the root. 
Following this procedure one eventually comes to one 
of the final nodes (called leaves), where the analyst 
finds a conclusion to which class the considered object 
should be assigned. 

Genetic Algorithms (or Evolutionary 
Programming) 

Genetic algorithms, biologically inspired search 
method, borrow mechanisms of inheritance to find 
solutions. Biological systems demonstrated flexibility, 
robustness and efficiency. Many biological systems are 
good at adapting to their environments. Some biological 
methods (such as reproduction, crossover and muta- 
tion) can be used as an approach to computer-based 
problem solving. An initial population of solutions is 
created randomly. Only a fixed number of candidate 
solutions are kept from one generation to the next. Those 
solutions that are less fit tend to die off, similar to the 
biological notion of "survival of the fittest". 

Regression Analysis 

This technique involves specifying a functional form 
that best describes the relationship between explana- 
tory, driving or independent variables and the target 
or dependent variable the decision maker is looking to 
explain. Business analysts typically utilize regression to 
identify the quantitative relationships that existbetween 
variables and enable them to forecast into the future. 
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Regression models also enable analysts to perform 
"what if" or sensitivity analysis. Some examples include 
how response rates change if a particular marketing 
or promotional campaign is launched, or how certain 
compensation policies affect employee performance 
and many more. 

Logistics Regression 

Logistic regression should be used when you want to 
predict the outcome of a dichotomous (e.g., yes/no) 
variable. This method is used for data that is not nor- 
mally distributed (bell-shaped curve) i.e., categorical 
(coded) data. When a dependent variable can only have 
one of two answers, such as "will graduate" or "will 
not graduate", you cannot get a normal distribution as 
previously discussed. 

Memory Based Reasoning (MBR) or the 
Nearest Neighbor Method 

To forecast a future situation, or to make a correct 
decision, such systems find the closest past analogs 
of the present situation and choose the same solution 
which was the right one in those past situations. The 
drawback of this application is that there is no guarantee 
that resulting clusters provide any value to the end user. 
Resulting clusters may just not make any sense with 
regards to the overall business environment. Because 
of limitations of this technique, no predictive, "what if" 
or variable/target connection can be implemented. 

The key differentiator between classification and 
segmentation with that of regression and neural network 
technology mentioned above is the inability of the for- 
mer to perform sensitivity analysis or forecasting. 

Applications and Benefits 

Data mining can be used widely in science and busi- 
ness areas for analyzing databases, gathering data and 
solving problems. In line with Berry and Linoff (2004), 
the benefits data mining can provide for businesses are 
limitless. Here are just a few examples: 

• Identify best prospects and then retain them as 
customers. 

By concentrating marketing efforts only on the best 
prospects, companies will save time and money, 



thus increasing effectiveness of their marketing 
operation. 

Predict cross-sell opportunities and make recom- 
mendations. 

Both traditional and Web-based operations can 
help customers quickly locate products of interest 
to them and simultaneously increase the value of 
each communication with the customers. 

• Learn parameters influencing trends in sales and 
margins. 

In the majority of cases we have no clue on what 
combination of parameters influences operation 
(black box). In these situations data mining is 
the only real option. 

• Segment markets and personalize communica- 
tions. 

There might be distinct groups of customers, 
patients, or natural phenomena that require dif- 
ferent approaches in their handling. 

The importance of collecting data that reflect specific 
business or scientific activities to achieve competitive 
advantage is widely recognized. Powerful systems for 
collecting data and managing it in large databases are in 
place in all large and mid-range companies. However, 
the bottleneck of turning this data into information is 
the difficulty of extracting knowledge about the system 
being studied from the collected data. Human analysts 
without special tools can no longer make sense of 
enormous volumes of data that require processing in 
order to make informed business decisions (Kudyba 
&Hoptroff,2001). 

The applications of data mining are everywhere: 
from biomedical data (Hu and Xu, 2005) to mobile user 
data (Goh and Taniar, 2005); from data warehousing 
(Tjioe and Taniar, 2005) to intelligent web personal- 
ization (Zhou, Cheung, & Fong, 2005); from analyz- 
ing clinical outcome (Hu, Song, Han, Yoo, Prestrud, 
Brennan, & Brooks, 2005) to mining crime patterns 
(Bagui, 2006). 

Potential Pitfalls 

Data Quality 

Data quality means the accuracy and completeness of 
the data. Data quality is a versatile issue that repre- 
sents one of the biggest challenges for data mining. 
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Data quality problem is of great importance due to 
the emergence of large volumes of data. Many busi- 
ness and industrial applications critically rely on the 
quality of information stored in diverse databases 
and data warehouses. As Seifert (2004) emphasized 
that data quality can be affected by the structure and 
consistency of the data being analyzed. Other factors 
like the presence of duplicate records, the lack of data 
standards, the timeliness of updates and human errors 
can significantly impact the effectiveness of complex 
data mining techniques, which are sensitive to subtle 
differences in data. To improve the quality of data it 
is sometimes necessary to clean data by removing the 
duplicate records, standardizing the values or symbols 
used in the database to represent certain information, 
accounting for missing data points, removing unneeded 
data fields, identifying abnormal data points. 

Interoperability 

Interoperability refers to the ability of computer system 
and/or data to work with other systems or data using 
common standards or process. Until recently, some 
government agencies elected not to gamble with any 
level of open access and operated isolated informa- 
tion systems. But isolated data is in many ways use- 
less data; bits of valuable information on the Sept. 
11, 2001 hijackers' activities may have been stored 
in a variety of databases at the federal, state, and lo- 
cal government levels, but that information was not 
colleted and available to those who needed to see it 
to glimpse a complete picture of the growing threat. 
So Seifert (2004) suggested that it is a critical part of 
the larger efforts to improve interagency collabora- 
tion and information sharing. For public data mining, 
interoperability of databases and software is important 
to enable the search and analysis of multiple databases 
simultaneously. This also ensures the compatibility of 
data mining activities of different agencies. 

Standardization 

This allows you to arrange customer information in 
a consistent format. Among the biggest challenges 
are inconsistent abbreviations, and misspellings and 
variant spellings. Among the types of data that can be 



appended are demographic, geographic, psychographic, 
behavioristic, event-driven and computed. Matching 
allows you to identify similar data within and across 
your data sources. One of the greatest challenges of 
matching is creating a system that incorporates your 
"business rules," or criteria for determining what 
constitutes a match. 

Preventing Decay 

The worst enemy of information is time. And informa- 
tion decays at different rates (Berry & Linoff, 2004). 
Cleaning your database is a large accomplishment, but 
it will be short-lived if you fail to implement procedures 
for keeping it clean at the source. According to the 
second law of thermodynamics, ordered systems tend 
to disorder, and a database is a very ordered system. 
Contacts move. Companies grow. Knowledge workers 
enter new customer information incorrectly. 

Some information simply starts out wrong, result 
of data input errors such as typos, transpositions, 
omissions and other mistakes. These are often easy to 
avoid. Finding ways to successfully implement these 
new technologies into a comprehensive data quality 
program not only increases the quality of your customer 
information, but also saves time, reduces frustration, 
improves customer relations, and ultimately increases 
revenue. Without constant attention to quality, your 
information quality will disintegrate. 

No Generalizations to a Population 

In statistics a population is defined, and then a sample 
is collected to make inferences about the population. 
This means that data cannot be re -used. They define a 
model before looking at the data. Data mining does not 
attempt generalizations to a population. The database 
is considered as the population. With the computing 
power of modern computers data miners can use the 
whole database, making sampling redundant. Data can 
be re-used. In data mining it is a common practice to 
try hundreds of models and find the one that fits best. 
This makes the interpretation of the significance dif- 
ficult. Machine learning is the data mining equivalent 
to regression. In machine learning we use a training set 
to train the system to find the dependent variable. 
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FUTURE TRENDS 
Predictive Analysis 

Augusta (2004) suggested that predictive analysis is one 
of the major future trends for data mining. Rather than 
being just about mining large amounts of data, predictive 
analytics looks to actually understand the data content. 
They hope to forecast based on the contents of the data. 
However this requires complex programming and a 
great amount of business acumen. They are looking to 
do more than simply archive data, which is what data 
mining is currently known for. They want to not just 
process it, but understand it more clearly which will 
in turn allow them to make better predictions about 
future behavior. With predictive analytics you have the 
program scour the data and try to form, or help form, 
new hypotheses itself. This shows great promise, and 
would be a boon for industries everywhere. 

Diversity of Application Domains 

Data mining and X" phenomenon, as Tuzhilin (2006) 
coined, where X constitutes a broad range of fields in 
which data mining is used for analyzing the data. This 
has resulted in a process of cross-fertilization of ideas 
generated within this diverse population of researchers 
interacting across the traditional boundaries of their 
disciplines. The next generation of data mining appli- 
cations covers a large number of different fields from 
traditional businesses to advance scientific research. 
Kantardzic & Zurada (2005) observed that with new 
tools, methodologies, and infrastructure, this trend of 
diversification will continue each year. 



CONCLUSION 

The emergence of new information technologies has 
given us much more data and many more options how 
to use it. Yet managing that flood of data, and making 
it useful and available to decision makers has been a 
major organizational challenge. Data mining allows 
the extraction of diamonds of knowledge from huge 
historical mines of data. It helps to predict outcomes 
of future situations, to optimize business decisions, to 
increase the value of each customer and communica- 
tion, and to improve customer satisfaction. 



The management of data requires understanding and 
a skill set far beyond mere programming. Managing 
data mining is a new revelation as analysts will have to 
sift through more and more information daily due to the 
ever increasing size of the Web and consumer purchases. 
Data mining can have enormous rewards if properly 
used. We have an unprecedented opportunity for the 
future is we could avoid data mining's pitfalls. 
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KEY TERMS 

Data Mining: The process of automatically search- 
ing large volumes of data for patterns. Data mining is a 
fairly recent and contemporary topic in computing. 

Data Visualization : A technology for helping users 
to see patterns and relationships in large amounts of 
data by presenting the data in graphical form. 

Explanatory Variables: Used interchangeably 
and refer to those variables that explain the variation 
of a particular target variable. Also called driving, or 
descriptive, or independent variables. 

Information Quality Decay: Quality of some 
data goes down when facts about real world objects 
change over time, but those facts are not updated in 
the database. 

Information Retrieval: The art and science of 
searching for information in documents, searching for 
documents themselves, searching for metadata which 
describe documents, or searching within databases, 
whether relational stand-alone databases or hypertext 
networked databases such as the Internet or intranets, 
for text, sound, images or data. 

Machine Learning: Concerned with the devel- 
opment of algorithms and techniques, which allow 
computers to "learn". 

Neural Networks : Also referred to as artificial intel- 
ligence (AI), which utilizes predictive algorithms. 

Pattern Recognition: The act of taking in raw data 
and taking an action based on the category of the data. 
It is a field within the area of machine learning. 

PredictiveAnalysis: Use of data mining techniques, 
historical data, and assumptions about future conditions 
to predict outcomes of events. 

Segmentation: Another major group that comprises 
the world of data mining involving technology that 
identifies not only statistically significant relation- 
ships between explanatory and target variables, but 
determines noteworthy segments within variable cat- 
egories that illustrate prevalent impacts on the target 
variable. 
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INTRODUCTION 

Information systems were developed in early 1960s to 
process orders, billings, inventory controls, payrolls, 
and accounts payables. Soon information systems 
research began. Harry Stern started the "Information 
Systems in Management Science" column in Manage- 
ment Science journal to provide a forum for discussion 
beyond just research papers (Banker & Kauffman, 
2004). Ackoff (1967) led the earliest research on man- 
agement information systems for decision-making pur- 
poses and published it in Management Science. Gorry 
and Scott Morton (1971) first used the term 'decision 
support systems' (DSS) in a paper and constructed a 
framework for improving management information 
systems. The topics on information systems and DSS 
research diversifies. One of the major topics has been 
on how to get systems design right. 

As an active component of DSS, which is part of 
today's business intelligence systems, data warehous- 
ing became one of the most important developments 
in the information systems field during the mid-to-late 
1990s. Since business environment has become more 
global, competitive, complex, and volatile, customer 
relationship management (CRM) and e-commerce ini- 
tiatives are creating requirements for large, integrated 
data repositories and advanced analytical capabili- 
ties. By using a data warehouse, companies can make 
decisions about customer-specific strategies such as 
customer profiling, customer segmentation, and cross- 
selling analysis (Cunningham et al., 2006). Thus how 
to design and develop a data warehouse have become 
important issues for information systems designers 
and developers. 

This paper presents some of the currently dis- 
cussed development and design methodologies in data 
warehousing, such as the multidimensional model vs. 
relational ER model, CIF vs. multidimensional meth- 



odologies, data-driven vs. metric-driven approaches, 
top-down vs. bottom-up design approaches, data par- 
titioning and parallel processing. 



BACKGROUND 

Data warehouse design is a lengthy, time-consuming, 
and costly process. Any wrongly calculated step can 
lead to a failure. Therefore, researchers have placed 
important efforts to the study of design and develop- 
ment related issues and methodologies. 

Data modeling for a data warehouse is different from 
operational database data modeling. An operational 
system, e.g., online transaction processing (OLTP), is a 
system that is used to run a business in real time, based 
on current data. An OLTP system usually adopts Entity- 
relationship (ER) modeling and application-oriented 
database design (Han & Kamber, 2006). An information 
system, like a data warehouse, is designed to support 
decision making based on historical point-in-time and 
prediction data for complex queries or data mining 
applications (Hoffer, et al., 2007). A data warehouse 
schema is viewed as a dimensional model (Ahmad et 
al., 2004, Han & Kamber, 2006; Levene & Loizou, 
2003). It typically adopts either a star or snowflake 
schema and a subject-oriented database design (Han & 
Kamber, 2006). The schema design is the most critical 
to the design of a data warehouse. 

Many approaches and methodologies have been 
proposed in the design and development of data 
warehouses. Two major data warehouse design meth- 
odologies have been paid more attention. Inmon et al. 
(2000) proposed the Corporate Information Factory 
(CIF) architecture. This architecture, in the design of 
the atomic-level data marts, uses denormalized entity- 
relationship diagram (ERD) schema. Kimball (1996, 
1997) proposed multidimensional (MD) architecture. 



Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. 



Data Warehousing Development and Design Methodologies 



This architecture uses star schema at atomic-level data 
marts. Which architecture should an enterprise follow? 
Is one better than the other? 

Currently, the most popular data model for data 
warehouse design is the dimensional model (Han & 
Kamber, 2006; Bellatreche, 2006). Some researchers 
call this model the data-driven design model. Artz 
(2006), nevertheless, advocates the metric-driven 
model, which, as another view of data warehouse 
design, begins by identifying key business processes 
that need to be measured and tracked over time in order 
for the organization to function more efficiently. There 
has always been the issue of top-down vs. bottom-up 
approaches in the design of information systems. The 
same is with a data warehouse design. These have been 
puzzling questions for business intelligent architects and 
data warehouse designers and developers. The next sec- 
tion will extend the discussion on issues related to data 
warehouse design and development methodologies. 



DESIGN AND DEVELOPMENT 
METHODOLOGIES 

Data Warehouse Data Modeling 

Database design is typically divided into a four-stage 
process (Raisinghani, 2000). After requirements are 
collected, conceptual design, logical design, and physi- 
cal design follow. Of the four stages, logical design 
is the key focal point of the database design process 
and most critical to the design of a database. In terms 
of an OLTP system design, it usually adopts an ER 
data model and an application-oriented database de- 
sign (Han & Kamber, 2006). The majority of modern 
enterprise information systems are built using the ER 
model (Raisinghani, 2000). The ER data model is 
commonly used in relational database design, where 
a database schema consists of a set of entities and the 
relationship between them. The ER model is used to 
demonstrate detailed relationships between the data 
elements. It focuses on removing redundancy of data 
elements in the database. The schema is a database 
design containing the logic and showing relationships 
between the data organized in different relations (Ahmad 
et al., 2004). Conversely, a data warehouse requires a 
concise, subj ect-oriented schema that facilitates online 
data analysis. A data warehouse schema is viewed as a 
dimensional model which is composed of a central fact 



table and a set of surrounding dimension tables, each 
corresponding to one of the components or dimensions 
of the fact table (Levene & Loizou, 2003). Dimensional 
models are oriented toward a specific business process 
or subject. This approach keeps the data elements as- 
sociated with the business process only one join away. 
The most popular data model for a data warehouse is 
multidimensional model. Such a model can exist in 
the form of a star schema, a snowflake schema, or a 
starflake schema. 

The star schema (see Figure 1) is the simplest data- 
base structure containing a fact table in the center, no 
redundancy, which is surrounded by a set of smaller 
dimension tables (Ahmad et al., 2004; Han & Kamber, 
2006). The fact table is connected with the dimension 
tables using many-to-one relationships to ensure their 
hierarchy. The star schema can provide fast response 
time allowing database optimizers to work with simple 
database structures in order to yield better execution 
plans. 

The snowflake schema (see Figure 2) is a variation 
of the star schema model, in which all dimensional 
information is stored in the third normal form, thereby 
further splitting the data into additional tables, while 
keeping fact table structure the same. To take care of 
hierarchy, the dimension tables are connected with 
sub-dimension tables using many-to-one relationships. 
The resulting schema graph forms a shape similar to a 
snowflake (Ahmad et al., 2004; Han & Kamber, 2006). 
The snowflake schema can reduce redundancy and save 
storage space. However, it can also reduce the effective- 
ness of browsing and the system performance may be 
adversely impacted. Hence, the snowflake schema is 
not as popular as star schema in data warehouse design 
(Han & Kamber, 2006). In general, the star schema 
requires greater storage, but it is faster to process than 
the snowflake schema (Kroenke, 2004). 

The starflake schema (Ahmad et al., 2004), also 
called galaxy schema or fact constellation schema 
(Han & Kamber, 2006), is a combination of the de- 
normalized star schema and the normalized snowflake 
schema (see Figure 3). The starflake schema is used in 
situations where it is difficult to restructure all entities 
into a set of distinct dimensions. It allows a degree 
of crossover between dimensions to answer distinct 
queries (Ahmad et al., 2004). Figure 3 illustrates the 
starflake schema. 

What needs to be differentiated is that the three 
schemas are normally adopted according to the diff er- 
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Figure 1. Example of a star schema (adapted from Kroenke, 2004) 
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Figure 2. Example of a snowfiake schema (adapted from Kroenke, 2004) 
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ences of design requirements. Adata warehouse collects 
information about subjects that span the entire organi- 
zation, such as customers, items, sales, etc. Its scope 
is enterprise-wide (Han & Kamber, 2006). Starflake 
schema can model multiple and interrelated subjects. 
Therefore, it is usually used to model an enterprise- 
wide data warehouse. A data mart, on the other hand, 
is similar to a data warehouse but limits its focus to 



a department subject of the data warehouse. Its scope 
is department-wide. The star schema and snowfiake 
schema are geared towards modeling single subjects. 
Consequently, the star schema or snowfiake schema 
is commonly used for a data mart modeling, although 
the star schema is more popular and efficient (Han & 
Kamber, 2006). 
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Figure 3. Example of a star/lake schema (galaxy schema or fact constellation) (adapted from Han & Kamber, 
2006) 
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CIF vs. Multidimensional 

Two major design methodologies have been paid 
more attention in the design and development of data 
warehouses. Kimball (1996, 1997) proposed multidi- 
mensional (MD) architecture. Inmon, Galemmco, and 
Geiger (2000) proposed the Corporate Information 
Factory (CIF) architecture. Imhoff et al. (2004) made a 
comparison between the two by using important criteria, 
such as scope, perspective, data flow, etc. One of the 
most significant differences between the CIF and MD 
architectures is the definition of data mart. For MD 
architecture, the design of the atomic-level data marts 
is significantly different from the design of the CIF data 
warehouse, while its aggregated data mart schema is 
approximately the same as the data mart in the CIF ar- 
chitecture. MD architecture uses star schemas, whereas 
CIF architecture uses denormalized ERD schema. This 
data modeling difference constitutes the main design 
difference in the two architectures (Imhoff et al., 2004). 
A data warehouse may need both types of data marts 
in the data warehouse bus architecture depending on 
the business requirements. Unlike the CIF architecture, 
there is no physical repository equivalent to the data 
warehouse in the MD architecture. 

The design of the two data marts is predominately 
multidimensional for both architecture, but the CIF 
architecture is not limited to just this design and can sup- 
port a much broader set of data mart design techniques. 
In terms of scope, both architectures deal with enterprise 
scope and business unit scope, with CIF architecture 
putting a higher priority on enterprise scope and MD 



architecture placing a higher priority on business unit 
scope. Imhoff et al. (2004) encourage the application 
of a combination of the data modeling techniques in 
the two architectural approaches, namely, the ERD or 
normalization techniques for the data warehouse and 
the star schema data model for multidimensional data 
marts. A CIF architecture with only a data warehouse 
and no multidimensional marts is almost useless and 
a multidimensional data-mart-only environment risks 
the lack of an enterprise integration and support for 
other forms of business intelligence analyses. 

Data-Driven vs. Metric-Driven 

Currently, the most popular data model for data ware- 
house design is the dimensional model (Han & Kamber, 
2006; Bellatreche, 2006). In this model, data from 
OLTP systems are collected to populated dimensional 
model. Researchers term a data warehouse design 
based on this model as a data-driven design model 
since the information acquisition processes in the data 
warehouse are driven by the data made available in the 
underlying operational information systems. Another 
view of data warehouse design is called the metric- 
driven view (Artz, 2006), which begins by identify- 
ing key business processes that need to be measured 
and tracked over time in order for the organization to 
function more efficiently. Advantages of data-driven 
model include that it is more concrete, evolutionary, 
and uses derived summary data. Yet the information 
generated from the data warehouse may be meaning- 
less to the user owing to the fact that the nature of the 
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derived summary data from OLTP systems may not be 
clear. The metric-driven design approach, on the other 
hand, begins first by defining key business processes 
that need to be measured and tracked over time. After 
these key business processed are identified, then they are 
modeled in a dimensional data model. Further analysis 
follows to determine how the dimensional model will 
be populated (Artz, 2006). 

According to Artz (2006), data-driven model to a 
data warehouse design has little future since informa- 
tion derived from a data-driven model is information 
about the data set. Metric-driven model, conversely, 
is possibly to have some key impacts and implications 
because information derived from a metric-driven 
model is information about the organization. Data- 
driven approach is dominating data warehouse design 
in organizations at present. Metric-driven, on the other 
hand, is at its research stage, needing practical applica- 
tion testimony of its speculated potentially dramatic 
implications. 

Top-Down vs. Bottom-Up 

There are two approaches in general to building a data 
warehouse prior to the data warehouse construction 
commencement, including data marts: the top-down 
approach and bottom-up approach (Han & Kamber, 
2006; Imhoff et al., 2004; Marakas, 2003). Top-down 
approach starts with a big picture of the overall, en- 
terprise-wide design. The data warehouse to be built 
is large and integrated, with a focus on integrating the 
enterprise data for usage in any data mart from the very 
first project (Imhoff et al., 2004). It implies a strategic 
rather than an operational perspective of the data. It 
serves as the proper alignment of an organization's 
information systems with its business goals and objec- 
tives (Marakas, 2003). However, this approach is risky 
(Ponniah, 2001). In contrast, a bottom-up approach is 
to design the warehouse with business-unit needs for 
operational systems. It starts with experiments and 
prototypes (Han & Kamber, 2006). With bottom-up, 
departmental data marts are built first one by one. It 
offers faster and easier implementation, favorable re- 
turn on investment, and less risk of failure, but with a 
drawback of data fragmentation and redundancy. The 
focus of bottom-up approach is to meet unit-specific 
needs with minimum regards to the overall enterprise- 
wide data requirements (Imhoff et al, 2004). 



An alternative to the above-discussed two ap- 
proaches is to use a combined approached (Han & 
Kamber, 2006), with which "an organization can exploit 
the planned and strategic nature of the top-down ap- 
proach while retaining the rapid implementation and 
opportunistic application of the bottom-up approach" 
(p. 129), when such an approach is necessitated in the 
undergoing organizational and business scenarios. 

Data Partitioning and Parallel 
Processing 

Data partitioning is the process of decomposing large 
tables (fact tables, materialized views, indexes) into 
multiple small tables by applying the selection opera- 
tors (Bellatreche, 2006). A good partitioning scheme 
is an essential part of designing a database that will 
benefit from parallelism (Singh, 1998). With a well 
performed partitioning, significant improvements in 
availability, administration, and table scan performance 
can be achieved. 

Parallel processing is based on a parallel database, 
in which multiprocessors are in place. Parallel data- 
bases link multiple smaller machines to achieve the 
same throughput as a single, larger machine, often with 
greater scalability and reliability than single proces- 
sor databases (Singh, 1998). In a context of relational 
online analytical processing (ROL AP), by partitioning 
data of ROLAP schema (star schema or snowflake 
schema) among a set of processors, OLAP queries 
can be executed in a parallel, potentially achieving a 
linear speedup and thus significantly improving query 
response time (Datta et al., 1 998; Tan, 2006). Given the 
size of contemporary data warehousing repositories, 
multiprocessor solutions are crucial for the massive 
computational demands for current and future OLAP 
system (Dehne et al., 2006). The assumption of most of 
the fast computation algorithms is that their algorithms 
can be applied into the parallel processing system 
(Dehne, 2006; Tan, 2006). As a result, it is sometimes 
necessary to use parallel processing for data mining 
because large amounts of data and massive search ef- 
forts are involved in data mining (Turban et al., 2005). 
Therefore, data partitioning and parallel processing are 
two complementary techniques to achieve the reduction 
of query processing cost in data warehousing design 
and development (Bellatreche, 2006). 
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FUTURE TRENDS 

Currently, data warehousing is largely applied in cus- 
tomer relationship management (CRM). However, there 
are up to date no agreed upon standardized rules for 
how to design a data warehouse to support CRM and a 
taxonomy of CRM analyses needs to be developed to 
determine factors that affect design decisions for CRM 
data warehouse (Cunningham et al., 2006). 

In data modeling area, to develop a more general 
solution for modeling data warehouse current ER model 
and dimensional model need to be extended to the next 
level to combine the simplicity of the dimensional model 
and the efficiency of the ER model with the support of 
object oriented concepts. 



CONCLUSION 

Several data warehousing development and design 
methodologies have been reviewed and discussed. 
Data warehouse data model differentiates itself from 
ER model with an orientation toward specific busi- 
ness purposes. It benefits an enterprise greater if the 
CIF and MD architectures are both considered in the 
design of a data warehouse. Some of the methodologies 
have been practiced in the real world and accepted by 
today's businesses. Yet new challenging methodologies, 
particularly in data modeling and models for physical 
data warehousing design, such as the metric-driven 
methodology, need to be further researched and de- 
veloped. 
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KEY TERMS 

Dimensions: They are the perspectives or entities 
with respect to which an organization wants to keep 
records (Han & Kamber, 2006, p. 110). 

Dimensional Model: A model containing a central 
fact table and a set of surrounding dimension tables, 
each corresponding to one of the components or dimen- 
sions of the fact table. 

Entity-Relationship Data Model: A model that 
represents database schema as a set of entities and the 
relationships among them. 

Fact Table: The central table in a star schema, 
containing the names of the facts, or measures, as well 
as keys to each of the related dimension tables. 

Metric-Drive Design: A data warehousing design 
approach which begins by defining key business pro- 
cesses that need to be measured and tracked over time. 
Then they are modeled in a dimensional model. 

Parallel Processing: The allocation of the operat- 
ing system's processing load across several processors 
(Singh, 1998, p. 209). 

Star Schema: A modeling diagram which contains 
a large central table (fact table) and a set of smaller 
attendant tables (dimension tables) each represented 
by only one table with a set of attributes. 
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BACKGROUND 



There are several ways of building complex distributed 
software systems, for example in the form of software 
agents. But regardless of the form, there are some com- 
mon problems having to do with specification contra 
execution. One of the problems is the inherent dynam- 
ics in the environment many systems are exposed to. 
The properties of the environment are not known with 
any precision at the time of construction. This renders 
a specification of the system incomplete by defini- 
tion. A traditional software agent is only prepared 
to handle situations conceived of and implemented at 
compile-time. Even though it can operate in varying 
contexts, its decision making abilities are static. One 
remedy is to prepare the distributed components for a 
truly dynamic environment, i.e. an environment with 
changing and somewhat unpredictable conditions. A 
rational software agent needs both a representation 
of a decision problem at hand and means for evalua- 
tion. AI has traditionally addressed some parts of this 
problem such as representation and reasoning, but 
has hitherto to a lesser degree addressed the decision 
making abilities of independent distributed software 
components (Ekenberg, 2000a, 2000b). Such decision 
making often has to be carried out under severe un- 
certainty regarding several parameters. Thus, methods 
for independent decision making components should 
be able to handle uncertainties on the probabilities and 
utilities involved. They have mostly been studied as 
means of representation, but are now being developed 
into functional theories of decision making suitable for 
dynamic use by software agents and other dynamic 
distributed components. Such a functional theory will 
also benefit analytical decision support systems intended 
to aid humans in their decision making. Thus, the ge- 
neric term agent below stands for a dynamic software 
component as well as a human or a group of humans 
assisted by intelligent software. 



Ramsey ( 1 926/78) was the first to suggest a theory that 
integrated ideas on subjective probability and utility 
in presenting (informally) a general set of axioms for 
preference comparisons between acts with uncertain 
outcomes (probabilistic decisions), von Neumann and 
Morgenstern (1947) established the foundations for a 
modern theory of utility. They stated a set of axioms 
that they deemed reasonable to a rational decision- 
maker (such as an agent), and demonstrated that the 
agent should prefer the alternative with the highest 
expected utility, given that she acted in accordance 
with the axioms. This is the principle of maximizing 
the expected utility. Savage (1954/72) published a 
thorough treatment of a complete theory of subjective 
expected utility. Savage, von Neumann, and others 
structured decision analysis by proposing reasonable 
principles governing decisions and by constructing a 
theory out of them. In other words, they (and later many 
others) formulated a set of axioms meant to justify their 
particular attitude towards the utility principle, cf., e.g., 
Herstein and Milnor (1953), Suppes (1956), Jeffrey 
(1965/83), and Luce and Krantz (1971). In classical 
decision analysis, of the types suggested by Savage 
and others, a widespread opinion is that utility theory 
captures the concept of rationality. 

After Raiffa (1968), probabilistic decision models 
are nowadays often given a tree representation (see Fig. 
1). A decision tree consists of a root, representing a 
decision, a set of event nodes, representing some kind 
of uncertainty and consequence nodes, representing 
possible final outcomes. In the figure, the decision is 
a square, the events are circles, and final consequences 
are triangles. Events unfold from left to right, until final 
consequences are reached. There may also be more than 
one decision to make, in which case the sub-decisions 
are made before the main decision. 
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Figure 1. Decision tree 
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In decision trees, probability distributions are as- 
signed in the form of weights (numbers) in the prob- 
ability nodes as measures of the uncertainties involved. 
Obviously, such a numerically precise approach puts 
heavy demands on the input capability of the agent. 
The shortcomings of this representation are many, 
and have to be compensated for, see, e.g., (Ekenberg, 
2000a). Among other things, the question has been 
raised whether people are capable of providing the 
input information that utility theory requires (cf ., e.g., 
(Fischhoff et al., 1983)). For instance, most people 
cannot clearly distinguish between probabilities rang- 
ing roughly from 0.3 to 0.7 (Shapira, 1995). Similar 
problems arise in the case of artificial agents, since 
utility-based artificial agents usually base their reason- 
ing on human assessments, for instance in the form of 
induced preference functions. The so-called reactive 
agents, for which this does not hold true, have not been 
put to use in dynamic domains involving uncertainty 
(cf., e.g., (Russell & Norvig, 1995)). Furthermore, 
even if an agent would be able to discriminate between 
different probabilities, very often complete, adequate, 
and precise information is missing. 

Consequently, during recent years of rather intense 
research activities several alternative approaches have 
emerged. Inparticular, first-order approaches, i.e., based 
on sets of probability measures, upper and lower prob- 
abilities, and interval probabilities, have prevailed. 
A main class of such models has been focused on 
expressing probabilities in terms of intervals. In 1953, 
the concept of capacities was introduced (Choquet, 
1953/54). This representation approach was further 



developed in (Huber, 1973, Huber & Strassen, 1973). 
Capacities have subsequently been used for modelling 
imprecise probabilities as intervals (capacities of order 
2(Denneberg, 1994)). Since the beginning of the 1960s 
the use of first-order (interval-valued) probability func- 
tions, by means of classes of probability measures, has 
been integrated in classical probability theory by, e.g., 
Smith (1961) and Good (1962). Similarly, Dempster 
(1967) investigated a framework for modelling upper 
and lower probabilities, which was further developed 
by Shafer (1976), where a representation of belief in 
states or events was provided. Within the AI community 
the Dempster-Shafer approach has received a good 
deal of attention. However, their formalism seems to 
be too strong to be an adequate representation of belief 
(Weichselberger & Pohlman, 1990). 

Other representations in terms of upper and lower 
probabilities have been proposed by, i.a., Hodges and 
Lehmann ( 1 952), Hurwicz (1951), Wald (1950), Kyburg 
(1961), Levi (1974, 1980), Walley (1991), Danielson 
and Ekenberg (1998, 2007), and Ekenberg et al. (2001). 
Upper and lower previsions have also been investigated 
by various authors. For instance, Shafer et al. (2003) 
suggests a theory for how to understand subjective 
probability estimates based on Walley (1991). A few 
approaches have also been based on logic, e.g., Nilsson 
( 1 986). He develops methods for dealing with sentences 
involving upper and lower probabilities. This kind of 
approaches has been pursued further by, among others, 
Wilson (1999). 
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SECOND-ORDER REPRESENTATIONS 

A common characteristic of the first-order representa- 
tions above is that they typically do not include all of 
the strong axioms of probability theory and thus they do 
not require an agent to model and evaluate a decision 
situation using precise probability (and, in some cases, 
value) estimates. An advantage of representations using 
upper and lower probabilities is that they do not require 
taking probability distributions into consideration. 
On the other hand, it is then often difficult to devise a 
reasonable decision rule that finds an admissible alter- 
native out of a set of alternatives and at the same time 
fully reflects the intensions of an agent (or its owner). 
Since the probabilities and values are represented by 
intervals, the expected value range of an alternative will 
also be an interval. In effect, the procedure retains all 
alternatives with overlapping expected utility intervals, 
even if the overlap is very small. Furthermore, they do 
not admit for discrimination between different beliefs 
in different values within the intervals. 

All of these representations face the same trade- 
off. Zero-order approaches (i.e. fixed numbers rep- 
resenting probability and utility assessments) require 
unreasonable precision in the representation of input 
data. Even though the evaluation and discrimination 
between alternatives becomes simple, the results are 
often not a good representative of the problem and 
sensitivity analyses are hard to carry out for more than 
a few parameters at a time. First-order approaches 
(e.g. intervals) offer a remedy to the representation 
problem by allowing imprecision in the representation 
of probability and utility assessments such as intervals, 
reflecting the uncertainty inherent in most real-life deci- 
sion problems faced by agents. But this permissibility 
opens up a can of worms in the sense that evaluation 
and discrimination becomes much harder because of 
overlap in the evaluation results of different options 
for the agent, i.e. the worst case for one alternative is 
no better than the best case for another alternative or 
vice versa, rendering a total ranking order between the 
alternatives impossible to achieve. The trade-off be- 
tween realistic representation and discriminative power 
has not been solved within the above paradigms. For 
a solution, one must look at second-order approaches 
allowing both imprecision in representation and power 
of admissible discrimination. 

Approaches for extending the interval representa- 
tion using distributions over classes of probability 



and value measures have been developed into various 
hierarchical models, such as second-order probability 
theory (Gardenfors & Sahlin, 1982, 1983, Ekenberg & 
Thorbiornson, 200 1 , Ekenberg et al., 2005). Gardenfors 
and Sahlin consider global distributions of beliefs, 
but restrict themselves to interval representations and 
only to probabilities, not utilities. Other limitations 
are that they neither investigate the relation between 
global and local distributions, nor do they introduce 
methods for determining the consistency of user-as- 
serted sentences. The same applies to Hodges and 
Lehmann (1952), Hurwicz (1951), and Wald (1950). 
Some more specialized approaches have recently been 
suggested, such as (Jaffray, 1999), (Nau, 2002), and 
(Utkin & Augustin, 2003). In general, very few have 
addressed the problems of computational complex- 
ity when solving decision problems involving such 
estimates. Needless to say, it is important in dynamic 
agents to be able to determine, in a reasonably short 
time, how various evaluative principles rank the given 
options in a decision situation. 

Ekenberg et al. (2006) and Danielson et al. (2007) 
provide a framework for how second-order repre- 
sentation can be systematically utilized to put belief 
information into use in order to efficiently discriminate 
between alternatives that evaluate into overlapping 
expected utility intervals when using first-order interval 
evaluations. The belief information is in the form of 
a joint belief distribution, specified as marginal belief 
distributions projected on each parameter. It is shown 
that regardless of the form of belief distributions over 
the originating intervals, the distributions resulting 
from multiplications and additions have forms very 
different from their components. This warp of result- 
ing belief demonstrates that analyses using only first- 
order information such as upper and lower bounds are 
not taking all available information into account. The 
method is based on the agent's belief in different parts 
of the intervals, expressed or implied, being taken into 
consideration. It can be said to represent the beliefs in 
various sub-parts of the feasible intervals. As a result, 
total lack of overlap is not required for successful dis- 
crimination between alternatives. Rather, an overlap by 
interval parts carrying little belief mass, i.e. representing 
a small part of the agent's belief, is allowed. Then, the 
non-overlapping parts can be thought of as being the 
core of the agent's appreciation of the decision situa- 
tion, thus allowing discrimination. 
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There are essentially three ways of evaluating (i.e. 
making a decision in) a second-order agent decision 
problem. The first way (centroid analysis) is to use 
the centroid as the best single-point representative of 
the distributions. The centroid is additive and mul- 
tiplicative. Thus, the centroid of the distribution of 
expected utility is the expected utility of the centroids 
of the projections. A centroid analysis gives a good 
overview of a decision situation. The second way 
(contraction analysis) is to use the centroid as a focal 
point (contraction point) towards which the intervals 
are decreased while studying the overlap in first-order 
expected utility intervals. The third way (distribution 
analysis) is more elaborated, involving the analysis 
of the resulting distributions of expected utility and 
calculating the fraction of belief overlapping between 
alternatives being evaluated. 



FUTURE TRENDS 

During recent years, the activities within the area of 
imprecise probabilities have increased substantially 
(IPP) and special conferences (ISIPTA) and journals 
(IJAR) are now dedicated to this theme. Second-order 
theories will in the future be fully developed into func- 
tional theories of decision making suitable for dynamic 
use by distributed software components. Algorithms for 
the efficient evaluation by agents using at least the first 
two ways of analyses above will be developed. 



CONCLUSION 

In this article, we discuss various approaches to proba- 
bilistic decision making in agents. We point out that 
theories incorporating second-order belief can provide 
more powerful discrimination to the agent (software 
agent or human being) when handling aggregations of 
interval representations, such as in decision trees or 
probabilistic networks, and that interval estimates (up- 
per and lower bounds) in themselves are not complete. 
This applies to all kinds of decision trees and proba- 
bilistic networks since they all use multiplications for 
the evaluations. The key idea is to use the information 
available in efficient evaluation of decision structures. 
Using only interval estimates often does not provide 
enough discrimination power for the agent to gener- 
ate a preference order among alternatives considered. 



Second-order methods are not just nice theories, but 
should be taken into account to provide efficient deci- 
sion methods for agents, in particular when handling 
aggregations of imprecise representations as is the case 
in decision trees or probabilistic networks. 



REFERENCES 

Choquet, G. (1953/54). Theory of Capacities, Ann. 
Inst. Fourier 5, 131-295. 

Danielson, M. & Ekenberg, L. (1998). A Framework 
for Analysing Decisions under Risk, European Journal 
of Operational Research 104(3), 474-484. 

Danielson, M. & Ekenberg, L. (2007). Computing Upper 
and Lower Bounds in Interval Decision Trees, European 
Journal of Operational Research 181, 808-816. 

Danielson, M., Ekenberg, L., & Larsson, A. (2007). 
Belief Distribution in Decision Trees, to appear in 
InternationalJournal of Approximate Reasoning, DOI 
10.1016/j.ijar.2006.09.012. 

Dempster, A.P. (1967). Upper and Lower Probabilities 
Induced by a Multivalued Mapping, Annals of Math- 
ematical Statistics xxxviii, 325-339. 

Denneberg, D. (1994). Non-Additive Measure and 
Integral, Kluwer Academic Publishers. 

Ekenberg, L. (2000a). Risk Constraints in Agent Based 
Decisions, A. Kent & J. G. Williams eds., Encyclopaedia 
of Computer Science and Technology 23:48, 263-280, 
Marcel Dekker. 

Ekenberg, L. (2000b). The Logic of Conflicts between 
Decision Making Agents, Journal of Logic and Com- 
putation 10(4), 583-602. 

Ekenberg, L., Boman, M., & Linneroth-Bayer, J. (200 1 ). 
General Risk Constraints, Journal of Risk Research 
4(1), 31-47. 

Ekenberg, L., Danielson, M., & Thorbiornson, J. (2006). 
Multiplicative Properties in Evaluation of Decision 
Trees, InternationalJournal of Uncertainty Fuzziness 
and Knowledge-Based Systems 14(3), 293-316. 

Ekenberg, L. & Thorbiornson, J. (2001). Second- 
Order Decision Analysis, International Journal of 
Uncertainty, Fuzziness and Knowledge-Based Systems 
9(1), 13-38. 



434 



Decision Making in Intelligent Agents 



Ekenberg, L., Thorbiornson, J., & Baidya, T. (2005). 
Value Differences using Second-order Distributions, 
InternationalJournalofApproximateReasoning38(l), 
81-97. 

Fischhoff, B., Goitein, B., & Shapira, Z. (1983). Sub- 
jective Expected Utility: AModel of Decision Making, 
Decision making under Uncertainty, R.W. Scholz, 
ed., Elsevier Science Publishers B.V. North-Holland, 
183-207. 

Gardenfors, P. & Sahlin, N.E. ( 1 982). Unreliable Prob- 
abilities, Risk Taking, and Decision Making, Synthese 
53, 361-386. 

Gardenfors, P. & Sahlin, N.E. (1983). Decision Making 
with Unreliable Probabilities, British Journal of Math- 
ematical and Statistical Psychology 36, 240-251. 

Good, I.J. ( 1 962). Subjective Probability as the Measure 
of a Non-measurable Set, Logic, Methodology and the 
Philosophy of Science, Suppes, Nagel, & Tarski, eds., 
Stanford University Press, 319-329. 

Herstein, I.N. & Milnor, J. (1953). An Axiomatic 
Approach to Measurable Utility, Econometrica 21, 
291-297. 

Hodges, J.L. & Lehmann, E.L. (1952). The Use of 
Previous Experience in Reaching Statistical Decisions, 
The Annals of Mathematical Statistics 23, 396-407. 

Huber, P.J. (1973). The Case of Choquet Capacities 
in Statistics, Bulletin of the International Statistical 
Institute 45, 181-188. 

Huber, P.J. & Strassen, V. (1973). Minimax Tests and 
the Neyman-Pearsons Lemma for Capacities, Annals 
of Statistics 1,251-263. 

Hurwicz, L. (1951). Optimality Criteria for Decision 
Making under Ignorance, Cowles Commission Discus- 
sion Paper 3 70. 

International Journal of Approximate Reasoning 
(IJAR), http://www.sciencedirect.com. 

Imprecise Probability Project (IPP), http://ippserv.rug. 
ac.be/home/ipp.html. 

ISIPTA Conferences, http://www.sipta.org. 

Jaffray, J-Y. (1999). Rational Decision Making With 
Imprecise Probabilities, Proceedings of ISIPTA99. 



Jeffrey, R. (1965/83). The Logic of Decision, 2nd ed., 
University of Chicago Press. (First edition 1965) 

Kyburg, H.E. (1961). Probability and the Logic of 
Rational Belief Connecticut: Wesleyan University 
Press. 

Levi, I. (1974). On Indeterminate Probabilities, The 
Journal of Philosophy 71, 391-418. 

Levi, I. (1980). The Enterprise of Knowledge, MIT 
Press. 

Luce,R.D. &Krantz,D. (1971). Conditional Expected 
Utility, Econometrica 39, 253-271. 

Nau, R.F. (2002). The aggregation of imprecise prob- 
abilities, Journal of Statistical Planning and Inference 
105, 265-282. 

von Neumann, J. & Morgenstern, O. (1947). Theory of 
Games and Economic Behaviour, 2nd ed., Princeton 
University Press. 

Nilsson,N. (1986). Probabilistic Logic, Artificial Intel- 
ligence 28, 71-87. 

Raiffa, H. (1968). Decision Analysis, Addison Wes- 
ley. 

Ramsey, F.P. (1 926/78). Truth and Probability, Founda- 
tions: Essays in Philosophy Logics, Mathematics and 
Economics, ed. Mellor, 58-100, Routledge and Kegan 
Paul. (Originally from 1926) 

Russell, S.J. &Norvig,P. (1995). Artificial Intelligence: 
A Modern Approach, Prentice-Hall. 

Savage, L. (1954/72). The Foundations of Statistics, 
2nd ed., John Wiley and Sons. (First edition 1954) 

Shafer, G., Gillet, P.R. & Scherl, R.B. (2003). Subjective 
Probability and Lower and Upper Prevision: A New 
Understanding, Proceedings of ISIPTA 03. 

Shafer, G. (197 6). A Mathematical Theory of Evidence, 
Princeton University Press. 

Shapira, Z. (1995). Risk Taking: A Managerial Perspec- 
tive, Russel Sage Foundation. 

Smith, C.A.B. (1961). Consistency in Statistical In- 
ference and Decision, Journal of the Royal Statistic 
Society Series B xxiii, 1-25. 



435 



Decision Making in Intelligent Agents 



Suppes, P. (1956). The Role of Subjective Probability 
and Utility Maximization, Proceedings of the Third 
Berkeley Symposium on Mathematical Statistics and 
Probability 1954-55 5, 113-134. 

Utkin, L.V. & Augustin,T. (2003). Decision Making 
with Imprecise Second-Order Probabilities, Proceed- 
ings of ISIPTA' 03. 

Wald, A. (1950). Statistical Reasoning with Imprecise 
Probabilities, Chapman and Hall. 

Walley, P. (1991). Statistical Decision Functions, John 
Wiley and Sons. 

Walley, P. (1997). Statistical inferences based on a 
second-order possibility distribution, International 
Journal of General Systems 9, 337-383. 

Weichselberger, K. & Pohlman, S. (1990). A Method- 
ology for Uncertainty in Knowledge-Based Systems, 
Springer-Verlag. 

Wilson, N. (1999). A Logic of Extended Probability, 
Proceedings of ISIPTA99. 



outcomes. Usually, probability distributions are as- 
signed in the form of weights in the probability nodes 
as measures of the uncertainties involved. 

Expected Value: Given a decision tree with r al- 
ternatives A. for i = 1,. . .,r, the expression 

E(A) = 

% 

/ j rii l / j rii 1 i 2 "' / j rii t i 2 '"i m2 i ml / { rii l i 2 '"i m2 / m _ 1 / m v ii , 



i 2 —i m _ 2 i m _J m l 



L=l i 2 =l 



where P., ...,j e (l,...,m), denote probability variables 
and v Jiih1 denote value variables, is the expected value 

of alternative A.. 

i 

Joint Belief Distribution: Let a unit cube be rep- 
resented by B = (b v ...,b k ) . By a joint belief distribution 
over B, we mean a positive distribution F defined on 
the unit cube B such that 



jV«dV B (x): 



KEY TERMS 

Admissible Alternative: Given a decision tree and 
two alternatives A and A., A is at least as good as A. iff 
E(A) - E(A ) > 0, where E(A.) is the expected value of 
A, for all consistent variable assignments for the prob- 
abilities and values. A is better than A. iff A is at least f 
as good as A. and E(A) - E(A.) > for some consistent £(*/)= J F ( x ) dv B -( x ) 
variable assignments for the probabilities and values. Bi 

A. is admissible iff no other A. is better. 

Centroid: Given a belief distribution F over a cube 

B, the centroid F of F is 



where V B is some k-dimensional Lebesque measure 
onB. 

Marginal Belief Distribution: Let a unit cube: 

B = (b iy ...,b k ) and F e BD(B) be given. Furthermore, 
let Br =(h i ,...,b i _ 1 ,b i+1 ,...,b k ). Then 



F c =JxF(x)dV B (x), 



is a marginal belief distribution over the axis b.. 

Projection: Let B = (b 1V ..,b k ) and A = (b. ,...,£>. ): 

z e {l,.../c} be unit cubes. Furthermore, letF e BD(B), 
and let 



where V B is some k-dimensional Lebesque measure 
onB. 

Decision Tree: A decision tree consists of a root 
node, representing a decision, a set of intermediate 
(event) nodes, representing some kind of uncertainty 
and consequence nodes, representing possible final 



f A (x)= J F(x)dV B _ A (x) • 



Then f A is the projection ofF on A. A projection of a 
belief distribution is also a belief distribution. 
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INTRODUCTION 

Many organisations, nowadays, have developed their 
own databases, in which a large amount of valuable 
information, e.g., customers' personal profiles, is 
stored. Such information plays an important role in 
organisations' development processes as it can help 
them gain a better understanding of customers' needs. 
To effectively extract such information and identify 
hidden relationships, there is a need to employ intel- 
ligent techniques, for example, data mining. 

Data mining is a process of knowledge discovery 
(Roiger & Geatz, 2003). There are a wide range of 
data mining techniques, one of which is decision trees. 
Decision trees, which can be used for the purposes of 
classifications and predictions, are a tool to support 
decision making (Lee et al., 2007). As a decision 
tree can accurately classify data and make effective 
predictions, it has already been employed for data 
analyses in many application domains. In this paper, 
we attempt to provide an overview of the applications 
that decision trees can support. In particular, we focus 
on business management, engineering, and health-care 
management. 

The structure of the paper is as follows. Firstly, Sec- 
tion 2 provides the theoretical background of decision 
trees. Section 3 then moves to discuss the applications 
that decision trees can support, with an emphasis on 
business management, engineering, and health-care 
management. For each application, how decision trees 
can help identify hidden relationships is described. 
Subsequently, Section 4 provides a critical discussion 



of limitations and identifies potential directions for 
future research. Finally, Section 5 presents the conclu- 
sions of the paper. 



BACKGROUND 

Decision trees are one of the most widely used classifica- 
tion and prediction tools. This is probably because the 
knowledge discovered by a decision tree is illustrated 
in a hierarchical structure, with which the discovered 
knowledge can easily be understood by individuals 
even though they are not experts in data mining (Chang 
et al., 2007). A decision tree model can be created in 
several ways using existing decision tree algorithms. 
In order to effectively adopt such algorithms, there is a 
need to have a solid understanding of the processes of 
creating a decision tree model and to identify suitability 
of the decision tree algorithms used. These issues are 
described in subsections below. 

Processes of Model Development 

A common way to create a decision tree model is to 
employ a top-down, recursive, and divide-and-conquer 
approach (Greene & Smith, 1993). Such a modelling 
approach enables the most significant attribute to be 
located at the top level as a root node and the least 
significant attributes to be located at the bottom level 
as leave nodes (Chien et al., 2007). Each path between 
the root node and the leave node can be interpreted as 
an ' if-then' rule, which can be used for making predica- 
tions (Chien et al., 2007; Kumar & Ravi, 2007). 
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To create a decision tree model on the basis of the 
above-mentioned approach, the modelling processes 
can be divided into three stages, which are: (1) tree 
growing, (2) tree pruning, and (3) tree selection. 



Tree Growing 

The initial stage of creating a decision tree model is 
tree growing, which includes two steps: tree merging 
and tree splitting. At the beginning, the non-significant 
predictor categorises and the significant categories 
within a dataset are grouped together (tree merging). 
As the tree grows, impurities within the model will 
increase. Since the existence of impurities may result 
in reducing the accuracy of the model, there is a need to 
purify the tree. One possible way to do it is to remove 
the impurities into different leaves and ramifications 
(tree splitting) (Chang, 2007). 

Tree Pruning 

Tree pruning, which is the key elements of the second 
stage, is to remove irrelevant splitting nodes (Kirkos 
et al., 2007). The removal of irrelevant nodes can help 
reduce the chance of creating an over-fitting tree. Such 
a procedure is particularly useful because an over-fit- 
ting tree model may result in misclassifying data in real 
world applications (Breiman et al., 1984). 

Tree Selection 

The final stage of developing a decision tree model is 
tree selection. At this stage, the created decision tree 
model will be evaluated by either using cross-validation 
or a testing dataset (Breiman et a/., 1984). This stage 
is essential as it can reduce the chances of misclassify- 
ing data in real world applications, and consequently, 
minimise the cost of developing further applications. 

Suitability of Decision Tree Algorithms 

Areview of existing literature shows that the most wide- 
ly used decision tree algorithms include the Iterative 
Dichotomiser 3 (ID3) algorithm, the C4.5 algorithm, the 
Chi-squared Automatic Interactive Detector (CHAID) 
algorithm, and the Classification and Regression Tree 
(CART) algorithm. Amongst these algorithms, there 
are some differences, one of which is the capability of 



modelling different types of data. As a dataset may be 
constructed by different types of data, e.g., categori- 
cal data, numerical data, or the combination of both, 
there is a need to use a suitable decision tree algorithm 
which can support the particular type of data used in 
the dataset. All of the above-mentioned algorithms can 
support the modelling of categorical data whilst only 
the C4.5 algorithm and the CART algorithm can be 
used for the modelling of numerical data (see Table 
1). This difference can also be used as a guideline for 
the selection of a suitable decision tree algorithm. The 
other difference amongst these algorithms is the pro- 
cess of model development, especially at the stages of 
tree growing and tree pruning. In terms of the former, 
the ID3 and C4.5 algorithms split a tree model into as 
many ramifications as necessary whereas the CART 
algorithm can only support binary splits. Regarding 
the latter, the pruning mechanisms located within the 
C4.5 and CART algorithms support the removal of 
insignificant nodes and ramifications but the CHAID 
algorithm hinders the tree growing process before the 
training data is being overused (see Table 1). 



DECISION TREE APPLICATIONS 

Business Management 

In the past decades, many organizations had created 
their own databases to enhance their customer services. 
Decision trees are a possible way to extract useful 
information from databases and they have already 
been employed in many applications in the domain 
of business and management. In particular, decision 
tree modelling is widely used in customer relationship 
management and fraud detection, which are presented 
in subsections below. 

Customer Relationship Management 

A frequently used approach to manage customers' 
relationships is to investigate how individuals ac- 
cess online services. Such an investigation is mainly 
performed by collecting and analyzing individuals' 
usage data and then providing recommendations based 
on the extracted information. Lee et al. (2007) apply 
decision trees to investigate the relationships between 
the customers' needs and preferences and the success 
of online shopping. In their study, the frequency of us- 
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ing online shopping is used as a label to classify users 
into two categories: (a) users who rarely used online 
shopping and (b) users who frequently used online 
shopping. In terms of the former, the model suggests 
that the time customers need to spend in a transaction 
and how urgent customers need to purchase a product 
are the most important factors which need to be con- 
sidered. With respect to the latter, the created model 
indicates that price and the degree of human resources 
involved (e.g. the requirements of contacts with the 
employees of the company in having services) are the 
most important factors. The created decision trees also 
suggest that the success of an online shopping highly 
depends on the frequency of customers' purchases and 
the price of the products. Findings discovered by deci- 
sion trees are useful for understanding their customers' 
needs and preferences. 

Fraudulent Statement Detection 

Another widely used business application is the detec- 
tion of Fraudulent Financial Statements (FFS). Such 
an application is particularly important because the 
existence of FFS may result in reducing the govern- 
ment's tax income (Spathis et a/., 2003). A traditional 
way to identify FFS is to employ statistical methods. 
However, it is difficult to discover all hidden informa- 
tion due to the necessity of making a huge number of 
assumptions and predefining the relationships among 
the large number of variables in a financial statement. 



Previous research has proved that creating a decision 
tree is a possible way to address this issue as it can 
consider all variables during the model development 
process. Kirkos et al. (2007) have created a decision 
tree model to identify and detect FFS. In their study, 
76 Greek manufacturing firms have been selected and 
their published financial statements, including balance 
sheets and income statements, have been collected for 
modelling purposes. The created tree model shows that 
all non-fraud cases and 92% of the fraud cases have 
been correctly classified. Such a finding indicates that 
decision trees can make a significant contribution for 
the detection of FFS due to a highly accurate rate. 

Engineering 

The other important application domain that decision 
trees can support is engineering. In particular, decision 
trees are widely used in energy consumption and fault 
diagnosis, which are described in subsections below. 

Energy Consumption 

Energy consumption concerns how much electricity 
has been used by individuals. The investigation of 
energy consumption becomes an important issue as it 
helps utility companies identify the amount of energy 
needed. Although many existing methods can be used 
for the investigation of energy consumption, decision 
trees appear to be preferred. This is due to the fact that 



Table 1. Characteristics of different decision tree algorithms 



Decision tree algorithms 


Data types 


Numerical data splitting 
method 


Possible tool 


CHAID(Kass, 1980) 


Categorical 


N/A 


SPSS Answer Tree (SPSS 
Inc, 2007) 


ID3 (Quinlan, 1986) 


Categorical 


No restrictions 


WEKA (Ian and Eibe, 
2005) 


C4.5 (Quinlan, 1993) 


Categorical, numerical 


No restrictions 


WEKA (Ian and Eibe, 
2005) 


CART (Breiman et a/., 1984) 


Categorical, numerical 


Binary splits 


CART 5.0 (Salford 
Systems, 2004) 
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a hierarchical structure provided by decision trees is 
useful to present the deep level of information and 
insight. For instance, Tso and Yau (2007) create a deci- 
sion tree model to identify the relationships between 
a household and its electricity consumptions in Hong 
Kong. Findings from their tree model illustrate that the 
number of household members are the most determinant 
factor of energy consumption in summer, and both 
the number of air-conditioner and the size of a flat are 
the second most important factors. In addition to such 
findings, their tree model identifies that a household 
with four or more members with a flat size larger than 
817ft 2 is the highest electricity consumption group. On 
the other hand, households which have less than four 
family members and without air-conditioners are the 
smallest electricity consumption group. Such findings 
from decision trees not only provide a deeper insight 
of the electricity consumptions within an area but also 
give guidelines to electricity companies about the right 
time they need to generate more electricity. 

Fault Diagnosis 

Another widely used application in the engineering 
domain is the detection of faults, especially in the 
identification of a faulty bearing in rotary machineries. 
This is probably because a bearing is one of the most 
important components that directly influences the op- 
eration of a rotary machine. To detect the existence of a 
faulty bearing, engineers tend to measure the vibration 
and acoustic emission (AE) signals emanated from the 
rotary machine. However, the measurement involves 
a number of variables, some of which may be less rel- 
evant to the investigation. Decision trees are a possible 
tool to remove such irrelevant variables as they can be 
used for the purposes of feature selection. Sugumaran 
and Ramachandran (2007) create a decision tree model 
to identify the features that may significantly affect 
the investigation of a faulty bearing. Through feature 
selection, three attributes were chosen to discriminate 
the faulty conditions of a bearing, i.e., the minimum 
value of the vibration signal, the standard deviation of 
the vibration signal, and kurtosis. The chosen attributes, 
subsequently, were used for creating another decision 
tree model. Evaluations from this model show that 
more than 95% of the testing dataset has been correctly 
classified. Such a highly accurate rate suggests that the 
removal of insignificant attributes within a dataset is 
another contribution of decision trees. 



Healthcare Management 

As decision tree modelling can be used for making 
predictions, there are an increasing number of studies 
that investigate to use decision trees in health-care man- 
agement. For instance, Chang (2007) has developed a 
decision tree model on the basis of 5 1 6 pieces of data to 
explore the hidden knowledge located within the medi- 
cal history of developmentally-delayed children. The 
created model identifies that the majority of illnesses 
will result in delays in cognitive development, language 
development, and motor development, of which accura- 
cies are 77.3%, 97.8%, and 88.6% respectively. Such 
findings can result in assisting healthcare professional to 
have an early intervention on developmentally-delayed 
children so as to help them catch up their normal peers 
in their development and growth. Another example 
of health-care management can be found in Delen et 
al. (2005). In their study, a decision tree is created to 
predict the survivability of breast cancer patients. The 
classification accuracy is 93.6% in their decision tree. 
This classification rate indicates that the created tree 
is highly accurate for predicting the survivability of 
breast cancer patients. These studies suggest that deci- 
sion tree is a useful tool to discover and explore hidden 
information in health-care management. 



FUTURE TRENDS 

The applications domains mentioned above demonstrate 
that decision tree is a very useful tool for data analyses. 
However, there are still many limitations which we need 
to be aware of and addressed in future works. 

Reliability of Findings 

Although decision tree is a powerful tool for data 
analyses, it seems that some data are misclassified in 
the decision tree models. A possible way to address 
this issue is to exploit the extracted knowledge by hu- 
man-computer collaboration. In other words, experts 
from different domains use their domain knowledge 
to filter findings from the created model. By doing 
so, the irrelevant findings can manually be removed. 
However, the drawback of employing such a method 
is the necessity of large investment as it involves the 
cost and time of experts from different domains. 
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KEY TERMS 

Attributes: Pre-defined variables in a dataset. 

Classification: An allocation of items or objects to 
classes or categories according to their features. 

Customer Relationship Management: Adynamic 

process to manage the relationships between a company 
and her customers, including collecting, storing and 
analysing customers' information. 

Data Mining: Also known as knowledge discovery 
in database (KDD), which is a process of knowledge 
discovery by analysing data and extracting information 
from a dataset using machine learning techniques. 

Decision Tree: A predictive model which can be 
visualized in a hierarchical structure using leaves and 
ramifications. 

Decision Tree Modelling: The process of creating 
a decision tree model. 

Fault Diagnosis: An action of identifying a mal- 
functioning system based on observing its behaviour. 

Fraud Detection Management: The detection of 
frauds, especially in those existing in financial state- 
ments or business transactions so as to reduce the risk 
of loss. 

Healthcare Management: The act of preventing, 
treating and managing illness, including the preservation 
of mental and physical problems through the services 
provided by health professionals. 

Prediction: A statement or a claim that a particular 
event will happen in the future. 
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INTRODUCTION 

The initial work introducing Dempster-Shafer (D-S) 
theory is found in Dempster (1967) and Shafer (1976). 
Since its introduction the very name causes confusion, 
a more general term often used is belief functions 
(both used intermittently here). Nguyen (1978) points 
out, soon after its introduction, that the rudiments of 
D-S theory can be considered through distributions of 
random sets. More furtive comparison has been with 
the traditional Bayesian theory, where D-S theory 
has been considered a generalisation of it (Schubert, 
1994). Cobb and Shenoy (2003) direct its attention to 
the comparison of D-S theory and the Bayesian for- 
mulisation. Their conclusions are that they have the 
same expressive power, but that one technique cannot 
simply take the role of the other. 

The association with artificial intelligence (AI) 
is clearly outlined in Smets (1990), who at the time, 
acknowledged the AI community has started to show 
interest for what they call the Dempster-Shafer model. 
It is of interest that even then, they highlight that there 
is confusion on what type of version of D-S theory 
is considered. D-S theory was employed in an event 
driven integration reasoning scheme in Xia et ah ( 1997), 
associated with automated route planning, which they 
view as a very important branch in applications of AI. 
Liu (1999) investigated Gaussian belief functions and 
specifically considered their proposed computation 
scheme and its potential usage in AI and statistics. 
Huang and Lees (2005) apply a D-S theory model in 
natural-resource classification, comparing with it with 
two other AI models. 

Wadsworth and Hall (2007) considered D-S theory 
in a combination with other techniques to investigate 
site-specific critical loads for conservation agencies. 
Pertinently, they outline its positioning with respect 
to AI (p. 400); 

The approach was developed in theAI (artificial intel- 
ligence) community in an attempt to develop systems 
that could reason in a more human manner and par- 



ticularly the ability of human experts to "diagnose" 
situations with limited information. 

This statement is pertinent here, since emphasis 
within the examples later given is more towards the 
general human decision making problem and the han- 
dling of ignorance in AI. Dempster and Kong (1988) 
investigated how D-S theory fits in with being an artifi- 
cial analogy for human reasoning under uncertainty. 

An example problem is considered, the murder of 
Mr. White, where witness evidence is used to classify 
the belief in the identification of an assassin from 
considered suspects. The numerical analyses presented 
exposit a role played by D-S theory, including the dif- 
ferent ways it can act on incomplete knowledge. 



BACKGROUND 

The background section to this article covers the basic 
formulisations of D-S theory, as well as certain de- 
velopments. Formally, D-S theory is based on a finite 
set of p elements = {s 1? s 2 , ..., s }, called a frame of 
discernment. A mass value is a function m\ 2® -^ [0, 
1] such that m(0) = (0 - the empty set) and: 



^m(s) =1 



(2® - the power set of 0). Any proper subset s of the 
frame of discernment 0, for which m(s) is non-zero, 
is called a focal element and represents the exact be- 
lief in the proposition depicted by s. The notion of a 
proposition here being the collection of the hypotheses 
represented by the elements in a focal element. 

In the original formulisation of D-S theory, from 
a single piece of evidence all assigned mass values 
sum to unity and there is no belief in the empty set. In 
the case of the Transferable Belief Model (TBM), a 
fundamental development on the original D-S theory 
(see Smets and Kennes, 1994), a non-zero mass value 
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can be assigned to the empty set allowing m(0) > 0. 
The set of mass values associated with a single piece 
of evidence is called a body of evidence (BOE), often 
denoted m(-). The mass value m(0) assigned to the 
frame of discernment is considered the amount of 
ignorance within the BOE, since it represents the level 
of exact belief that cannot be discerned to any proper 
subsets of 0. 

D-S theory also provides a method to combine the 
BOE from different pieces of evidence, using Demp- 
ster's rule of combination. This rule assumes these 
pieces of evidence are independent, then the function 
(m 1 m 2 ): 2® -» [0, 1], defined by: 



final results. Moreover, partial answers are present in 
the final BOE produced (through the combination of 
evidence), including focal elements with more than one 
element, unlike the Bayesian approach where prob- 
abilities on only individual elements would be accrued. 
This restriction of the Bayesian approach to consider 
singleton elements is clearly understood through the 
'Principle of insufficient Reason', see Beynon et al. 
(2000) and Beynon (2002, 2005). 

To enable final results to be created with D-S theory, 
a number of concomitant functions exist with D-S 
theory, including; 

z) The Belief function, 



(m x m 2 )(x) 



Bd(s f )= 2>(s y ) 






X = 


2>i( s i) m 2( s 2 ) 




S i ns 2 =X 


x*0 



i- 2>i( s i) m 2( s 2 ) 



(1) 



is a mass value, where s 1 and s 2 are focal elements from 
the BOEs, m^) and m 2 (-), respectively. The denominator 
part of the combination expression includes: 

2>i( s i)™ 2 ( s 2 ) 

s 1 ns 2 =0 

that measures the level of conflict in the combination 
process (Murphy, 2000). It is the existence of the de- 
nominator part in this combination rule that separates 
D-S theory (includes it) from TBM (excludes it). 
Benouhiba and Nigro (2006) view this difference as 
whether considering the conflict mass: 



( Z™i( s i) m 2 ( s 2 )) 



as a further form of ignorance mass is an acceptable 
point of view. 

D-S theory, along with TBM, also differs to the 
Bayesian approach in that it does not necessarily produce 



for all s. cz 0, representing the confidence that a 
proposition y lies in s. or any subset of s., 
ii) The Plausibility function, 



Pls(s.)= £>( S> ) 

for all s. c 0, represents the extent to which we 
fail to disbelieve s., 
in) The Pignistic function (see Smets and Kennes, 
1994), 



BetP(s.)= X m ( s j> 



s.nsj 



s.-cG.s.^C 



for all s. c 0, represents the extent to which we 
fail to disbelieve s.. 



From the definitions given above, the Belief function 
is cautious of the ignorance incumbent in the evidence, 
where as the Plausibility function is more inclusive of 
its presence. The Pignistic function acts more like a 
probability function, partitioning levels of exact belief 
(mass) amongst the elements of the focal element it is 
associated with. 
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A non-specificity measure N(m(-)) within D-S 
theory was introduced by Dubois and Prade (1985), 
the formula is defined as, 



N(m(-))= £m(s,)log 2 |s. 



where Is I is the number of elements in the focal ele- 

1 r 

ment s . Hence, N(m(-)) is considered the weighted 
average of the focal elements, with m(-) the degree 
of evidence focusing on s , while log 2 |s.| indicates the 
lack of specificity of this evidential claim. The general 
range of this measure is [0, log 2 |0|] (given in Klir and 
Wierman, 1998), where |0| is the number of elements 
in the frame of discernment 0. 

Main Thrust 

The main thrust of this article is an exposition of the 
utilisation of D-S theory. The small example problem 
considered here relates to the assassination of Mr White, 
many derivatives of this example exist. An adaptation 
of a version of this problem given in Smets (1990) is 
discussed, more numerical based here, which allows 
interpretation with D-S theory and its development 
TBM to be made. 

There are three individuals who are suspects for 
the murder of Mr. White, namely, Henry, Tom and 
Sarah, within D-S theory they make up the frame of 
discernment, = {Henry, Tom, Sarah} . There are two 
witnesses who have information regarding the murder 
of Mr. White; 

Witness 1, is 80% sure that the murderer was a man, 
it follows, the concomitant body of evidence (BOE), 
defined m (-), includes m^fHenry, Tom}) = 0.8. Since 
we know nothing about the remaining mass value it 
is considered ignorance, and allocated to 0, hence 
m^fHenry, Tom, Sarah}) = 0.2 (= m/0)). 

Witness 2, is 60% confident that Henry was leaving 
on a jet plane when the murder occurred, so a BOE 
defined m 2 (-) includes, m 2 ({Tom, Sarah}) = 0.6 and 
m 2 ({Henry, Tom, Sarah}) = 0.4. 

The aggregation of these two sources of information 
(evidence from the two witnesses), using Dempster's 
combination rule (1), is based on the intersection and 



Table 1. Intermediate combination ofBOEs, m x (-) and 



m^) \ m 2 (.) 


{Tom, Sarah}, 0.6 


0,0.4 


{Henry, Tom}, 0.8 


{Tom}, 0.48 


{Henry, Tom}, 
0.32 


0,0.2 


{Tom, Sarah}, 
0.12 


0, 0.08 



multiplication of the focal elements and mass values 
from the BOEs, m^-) and m 2 {-), see Table 1. 

In Table 1, the intersection and multiplication of the 
focal elements and mass values from the BOEs, m^) 
and m 2 (-) are presented. The new focal elements found 
are all non-empty, it follows, the level of conflict 



Z m i( s i) m 2( s 2 ) =0 > 

s l ns 2=0 

then the resultant BOE, defined n? 3 (-), can be taken 
directly from the results in Table 1; 



0.48, m 3 ( {Henry, Tom}) 
0.12 

0.08. 



m 3 ({Tom}) 

m 3 ({Tom, Sarah}) 

and m 3 ( {Henry, Tom, Sarah}) 



0.32, 



Amongst this combination of evidence (n? 3 (-))> 
the mass value assigned to ignorance (m 3 ( {Henry, 
Tom, Sarah}) = 0.08) is less than that present in the 
original constituent BOEs, as expected when combin- 
ing evidence using D-S theory. To further exposit the 
effect of the combination of evidence, the respective 
non-specificity values associated with BOEs shown 
here are calculated. For the two witnesses, with their 
BOEs, m x (-) and m 2 (-); 

N(m 1 (-))= I>i( s ,) lo g2|s 7 | 



= m^ {Henry, Tom} )log 2 | {Henry, Tom}| 

+ m^ {Henry, Tom, Sarah} )log 2 | {Henry, Tom, Sarah} |, 
= 0.8 x log 2 2 + 0.2 x log 2 3 = 1.117, 

and N(m 2 (-)) = 1.234. The non-specificity associated 
with the combined is similarly calculated, found to be 
N(m (•)) = 0.567. The values further demonstrate the 
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effect of the combination process, namely a level of 
concomitant non-specificity associated with the BOE 
m 3 (-), found from the combination of the other two 
BOEs mfi) and m 2 (-). 

To allow a comparison of this combination process, 
D-S theory is used with the situation for TBM, the evi- 
dence from witness 2 is changed slightly, becoming; 

Witness 2, is 60% confident that Henry and Tom were 
leaving on a jet plane when the murder occurred, so 
a BOE defined m 2 (-) includes, m 2 ({Sarah}) = 0.6 and 
m 2 ({Henry, Tom, Sarah}) = 0.4. 

The difference between the two 'Witness 2' state- 
ments is that, in the second statement, now Tom is also 
considered to be leaving on the jet plane with Henry. 
The new intermediate calculations when combining the 
evidence from the two witnesses is shown in Table 2. 

In the intermediate results in Table 2, there is an 
occasion where the intersection of two focal elements 
from m^) and m 2 (-) results in an empty set (0). It 
follows, 



JXteiKte) =0.48, 

s l ns 2 = 

giving the value, 1-0.48 = 0.52, forms the denominator 
in the expression for the combination of this evidence 
(see (1)), so the resultant BOE, here defined m 4 (-), is; 

m 4 ( {Henry, Tom}) = 0.32/0.52 = 0.615, m 4 ({Sarah}) = 0.231 
and m 4 ( {Henry, Tom, Sarah}) = 0.154. 

Comparison with the results in the BOEs, m 3 (-) 
and m 4 (-), show how the mass value associated with 
m 3 ({Tom}) = 0.48 has been spread across the three 
focal elements which make up the m (•) BOE. 



Table 2. Intermediate combination of BOEs, m^-) and 
m 2 (-), with the new Witness 2' evidence 



m£) \ m 2 (-) 


{Sarah}, 0.6 


0,0.4 


{Henry, Tom}, 0.8 


0, 0.48 


{Henry, Tom} , 
0.32 


0,0.2 


{Sarah}, 
0.12 


0, 0.08 



This approach to counter the conflict possibly pres- 
ent when combining evidence is often viewed as not 
appropriate, with TBM introduced to offer a solution, 
hence using the second 'Witness 2' statement, the re- 
sultant combined BOE, defined m 5 (-), is taken directly 
from Table 2; 

m 5 (0) = 0.48, m 5 ( {Henry, Tom}) = 0.32, 

m 5 ({Sarah}) = 0.12 and m 5 ({Henry, Tom, Sarah}) = 0.08. 

The difference between the BOEs, m 4 (-) and m 5 (-)> 
is in the inclusion of the focal element m 5 (0) = 0.48, 
allowed when employing TBM. Beyond the differ- 
ence in the calculations made between D-S theory and 
TBM, the important point is what is the interpretation 
to the m 5 (0) expression in TBM. Put succinctly, fol- 
lowing Smets (1990), m 5 (0) = 0.48 corresponds to 
that amount of belief allocated to none of the three 
suspects, taken further it is the proposition that none 
of the three suspects is the murderer. Since the three 
individuals are only suspects, the murderer might be 
someone else, if the initial problem has said that one 
of the three individuals is the murderer then the D-S 
theory approach should be adhered to. 

Returning to the analysis of the original witness 
statements, the partial results presented so far do not 
identify explicitly which suspect is most likely to have 
undertaken the murder of Mr. White. To achieve explicit 
results, the three measures, Bel(s.), Pls(s.) and BetP(s.) 
previously defined, are considered on singleton focal 
elements (s. are individual suspects); 

Bel( {Henry}) = JX*^) = °'°°> 

scz{ Henry} 

similarly Bel({Tom}) = 0.48 and Bel({Sarah}) = 0.00. 
Pls( {Henry}) = JX^) = m 3 ({Henry, Tom}) + 

Sj n{Henry}^0 

m 3 ( {Henry, Tom, Sarah}), 
= 0.32 + 0.08 = 0.40, 

similarly, Pls({Tom}) = 1.00 and Pls({ Sarah}) = 0.20. 

Zl {Henry} ns.l 
m 3 (sX- J -L, 

7 I c I 

Sj^®,Sj*0 I bj I 

I {Henry} n {Henry, Tom} | 



m 3 ( {Henry, Torn}) 1 



| {Henry, Tom} | 
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m 3 (@) 



{Henry} n & \ 

l@l 



= 0.16 + 0.027 = 0.187, 
similarly, BetP({Tom}) = 0.727 and BetP({Sarah}) 
= 0.087. 

In this small example, all three measures identify the 
suspect Tom as having the most evidence purporting 
to them being the murderer of Mr. White. 



FUTURE TRENDS 

Dempster-Shafer (D-S) theory is a methodology that 
offers an alternative, possibly developed generality, 
to the assignment of frequency-based probability to 
events, in its case levels of subjective belief. However, 
the issues surrounding its position with respect to other 
methodologies such as the more well known Bayesian 
approach could be viewed as stifling it utilisation. The 
important point to remember when considering D-S 
theory is that it is a general methodology that requires 
subsequent pertinent utilisation when deriving nascent 
techniques. 

Future work needs to aid in finding the position of 
D-S theory relative to the other methodologies. That 
is, unlike methodologies like fuzzy set theory, D-S 
theory is not able to be employed straight on top of 
existing techniques, to create a D-S type derivative of 
the technique. Such derivatives, for example, could 
operate on incomplete data, including when there are 
missing values, their reason for missing possibly due 
to ignorance etc. 



CONCLUSION 

Dempster-Shafer (D-S) theory, and its general devel- 
opments, continues to form the underlying structure 
to an increasing number of specific techniques that 
attempt to solve certain problems within the context of 
uncertain reasoning. As mentioned in the future trends 
section, the difficulty with D-S theory is that it needs 
to be considered at the start of work at creating a new 
technique for analysis. It follows, articles like this 
which show the rudimentary workings of D-S theory 
allow researchers the opportunity to see its operation, 
and so may contribute to its further utilisation. 
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KEY TERMS 

Belief: In Dempster-Shafer theory, the level of 
representing the confidence that a proposition lies in 
a focal element or any subset of it. 

Body of Evidence: In Dempster-Shafer theory, a 
series of focal elements and associated mass values. 

Focal Element: In Dempster-Shafer theory, a set 
of hypotheses with positive mass value in a body of 
evidence. 

Frame of Discernment: In Dempster-Shafer theory, 
the set of all hypotheses considered. 

Dempster-Shafer Theory: General methodol- 
ogy, also known as the theory of belief functions, 
its rudiments are closely associated with uncertain 
reasoning. 

Ignorance: In Dempster-Shafer theory, the level of 
mass value not discernible among the hypotheses. 

Mass Value: In Dempster-Shafer theory, the level 
of exact belief in a focal element. 

Non-Specificity: In Dempster-Shafer theory, the 
weighted average of the focal elements' mass values in 
a body of evidence, viewed as a species of a higher un- 
certainty type, encapsulated by the term ambiguity. 

Plausibility: In Dempster-Shafer theory, the extent 
to which we fail to disbelieve a proposition lies in a 
focal element. 
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INTRODUCTION 

Annotated data have recently become more important, 
and thus more abundant, in computational linguistics . 
They are used as training material for machine learning 
systems for a wide variety of applications from Parsing 
to Machine Translation (Quirk et al., 2005). Depen- 
dency representation is preferred for many languages 
because linguistic and semantic information is easier to 
retrieve from the more direct dependency representa- 
tion. Dependencies are relations that are defined on 
words or smaller units where the sentences are divided 
into its elements called heads and their arguments, e.g. 
verbs and objects. Dependency parsing aims to predict 
these dependency relations between lexical units to 
retrieve information, mostly in the form of semantic 
interpretation or syntactic structure. 

Parsing is usually considered as the first step of 
Natural Language Processing (NLP). To train statisti- 
cal parsers, a sample of data annotated with necessary 
information is required. There are different views 
on how informative or functional representation of 
natural language sentences should be. There are dif- 
ferent constraints on the design process such as: 1) 
how intuitive (natural) it is, 2) how easy to extract 
information from it is, and 3) how appropriately and 
unambiguously it represents the phenomena that occur 
in natural languages. 

In this article, a review of statistical dependency 
parsing for different languages will be made and cur- 
rent challenges of designing dependency treebanks and 
dependency parsing will be discussed. 



DEPENDENCY GRAMMAR 

The concept of dependency grammar is usually at- 
tributed to Tesniere (1959) and Hays (1964). The 
dependency theory has since developed, especially 
with the works of Gross (1964), Gaiffman (1965), 
Robinson (1970), Mel'cuk (1988), Starosta (1988), 
Hudson ( 1 984, 1 990), Sgall et al. ( 1 986), Barbero et al. 



(1998), Duchier (2001), Menzel and Schroder (1998), 
Kruijff (2001). 

Dependencies are defined as links between lexical 
entities (words or morphemes) that connect heads and 
their dependants. Dependencies may have labels, such 
as subject, object, and determiner or they can be unla- 
beled. A dependency tree is often defined as a directed, 
acyclic graph of links that are defined between words 
in a sentence. Dependencies are usually represented 
as trees where the root of the tree is a distinct node. 
Sometimes dependency links cross. Dependency graphs 
of this type are non-projective. Projectivity means that 
in surface structure a head and its dependants can only 
be separated by other dependants of the same head 
(and dependants of these dependants). Non-projec- 
tive dependency trees cannot be translated to phrase 
structure trees unless treated specially. We can see 
in Table 1 that the notion of non-projectivity is very 
common across languages although distribution of it 
is usually rare in any given language. The fact that it 
is rare does not make it less important because it is 
this kind of phenomena that makes natural languages 
more interesting and that makes all the difference in 
the generative capacity of a grammar that is suggested 
to explain natural languages. 

An example dependency tree is in Figure 1. The 
corresponding phrase structure tree is shown in Figure 
2. The ROOT of this tree is "hit". 

Given the basic concept of dependency, different 
theories of dependency grammar exist. Among many 
well known are: Functional Generative Description 
(Sgall et al., 1969, 1986), (Petkevic, 1987, 1995), De- 
pendency Unification Grammar (DUG) Hellwig (1986, 
2003), Meaning Text Theory (Gladkij and Mel'cuk, 
1975), (Mel'cuk, 1988) and Lexicase (Starosta, 1988), 
Topological Dependency Grammar (Gerdes and Kah- 
ane, 2001). Kruijff (2001) also suggests a type of logic 
for dependency grammar, "Dependency Grammar 
Logic" which aims transparent semantic interpretation 
during parsing. 

There are many open issues regarding the rep- 
resentation of dependency structure. Hays (1964) 
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Figurel. Dependency Tree for the sentence "The red car hit the big motorcycle ' 




The black car hit the big motorcycle 



K^J 



Figure 2. Phrase Structure Tree for the sentence in Figure 1 



NP VP 
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Art Adj N V NP 

Art Adj N 

The black car hit the big motorcycle 



and Gaifman (1965) take dependency grammars as 
special cases of phrase structure grammars whereas 
Barbero et al. (1998), Menzel and Schroder (1998), 
Eisner (2000), Samuelsson (2000), Duchier (2001), 
Gerdes and Kahane (2001), Kruijff (2001) think they 
are completely different. 

Generative capacity of dependency grammars 
has long been discussed (Gross, 1964), (Hays, 1964), 
(Gaifman, 1965), (Robinson, 1970). Dependency 
grammars were proved to be context-free (Gaiffman, 
1965). When natural languages were proved to be not 
context-free, but in a class called "Mildly Context- 
Sensitive" (Joshi, 1985) they were abandoned until 
90s, when Vijayashanker and Weir (1994) showed 
that Head Grammars -an extension of CFGs- (Pollard, 
1984) are mildly context-sensitive like Tree Adjoining 
Grammar (TAG), (Joshi et al., 1975) and Combina- 
tory Categorial Grammar (CCG), (Steedman,2000). 
Recently, Kuhlmann andMohl (2007) defined "regular 
dependency languages" and showed that applying dif- 
ferent combinations of gap-degree and well-nestedness 
restrictions on non-proj ectivity in these languages gave 
a class of mildly context-sensitive grammars. 



DEPENDENCY TREEBANKS 
Why Dependency Trees? 

Many new corpora have been designed and created 
in the past few years. Dependency representation is 
preferred when these corpora are designed. This can 
be argued by the following properties of dependency 
trees: 



1. 



2. 



3. 



They are easier to annotate than some other repre- 
sentation types like phrase structure trees (PST). 
There are fewer tags and labels (only as many 
as words in a sentence) and no internal nodes to 
name the phrases as in PSTs. 
Some information such as predicate-argument 
structure can be extracted trivially from them 
which is not the case for PSTs. 
Another interesting result is that some dependency 
parsers run much faster than PST parsers. Com- 
putational complexity of a standard PST parser 
is 0(n 5 ) whereas a non-proj ective DT parser runs 
in Q(n 2 ). 
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Dependency Treebanks 

Table 1 compares dependency corpora of 19 languages 1 . 
This information is gathered from CoNLL-X and 
CoNNL 2007 shared tasks on dependency parsing. The 
reader is referred to Buchholz and Marsi (2006) and 
Nivre et al. (2007) for more information on dependency 
treebanks included in the tasks. Although, the underly- 
ing theory is the same in all of these treebanks there are 
maj or differences in the outcome that originate from the 
questions like 1) how much information is needed to put 
in the dependency trees, 2) how strongly interlaced the 
different modules such as morphology syntax are in a 
language. For instance, Czech treebank (Bohmova et al. , 
2003) has 3 different levels of representation, namely, 



morphological, grammatical and tecto-grammatical 
layers. Morphology-syntax interface in Turkish makes 
word-based dependencies inappropriate (£akici, 2008). 
Therefore, dependencies are between morphological 
sub-groups called inflectional groups (IG) rather than 
words. These are two arguments among many on why 
it is very important to make a good feasibility study 
when designing a dependency treebank as different 
aspects of languages require different treatment. 



DEPENDENCY PARSING 

Statistical or data-driven parsing methods have gained 
more focus with the continuous introduction of new 



Table 1. Treebank information; #T = number of tokens * 1000, #S = number of sentences * 1000, #T/#S = tokens 
per sentence, %NST = % of non-scoring tokens (only in CoNLL-X), %NPR = % of non-pro jective relations, 
%NPS = % ofnon-projective sentences, IR = has informative root labels 



Language 


#T 


#S 


#T/#S 


%NST 


%NPR 


%NPS 


IR 


Arabic 


54 


112 


1.5 


2.9 


37.2 


38.3 


8.8 


- 


0.4 


0.4 


11.2 


10.1 


Y 


- 


Basque 


- 


51 


- 


3.2 


- 


38.3 


- 


- 


- 


2.9 


- 


26.2 


- 


- 


Bulgarian 


190 




12.8 


- 


14.8 


- 


14.4 


- 


0.4 


- 


5.4 


- 


N 


- 


Catalan 


- 


431 


- 


15 


- 


28.8 


- 


- 


- 


0.1 


- 


2.9 


- 


- 


Chinese 


337 


337 


57 


57 


5.9 


5.9 


0.8 


- 


0.0 


0.0 


0.0 


0.0 


N 


- 


Czech 




432 


72.7 


25.4 


17.2 


17.0 


14.9 


- 


1.9 


1.9 


23.2 


23.2 


Y 


- 


Danish 


94 


- 


5.2 


- 


18.2 


- 


13.9 


- 


1.0 


- 


15.6 


- 


N 


- 


Dutch 


195 


- 


13.3 


- 


14.6 


- 


11.3 


- 


5.4 


- 


36.4 


- 


N 


- 


English 


- 


447 


- 


18.6 


- 


24.0 


- 


- 


- 


0.3 


- 


6.7 


- 


- 


German 


700 


- 


39.2 


- 


17.8 


- 


11.5 


- 


2.3 


- 


27.8 


- 


N 


- 


Greek 


- 


65 


- 


2.7 


- 


24.2 


- 


- 


- 


1.1 


- 


20.3 


- 


- 


Hungarian 


- 


132 


- 


6.0 


- 


21.8 


- 


- 


- 


2.9 


- 


26.4 


- 


- 


Italian 


- 


71 


- 


3.1 


- 


22.9 


- 


- 


- 


0.5 


- 


7.4 


- 


- 


Japanese 


151 


- 


17 


- 


8.9 


- 


11.6 


- 


1.1 


- 


5.3 


- 


N 


- 


Portuguese 


207 


- 


9.1 


- 


22.8 


- 


14.2 


- 


1.3 


- 


18.9 


- 


Y 


- 


Slovene 


29 


- 


1.5 


- 


18.7 


- 


17.3 


- 


1.9 


- 


22.2 


- 


Y 


- 


Spanish 


89 


- 


3.3 


- 


27 


- 


12.6 


- 


0.1 


- 


1.7 


- 


N 


- 


Swedish 


91 


- 


11 


- 


17.3 


- 


11.0 


- 


1.0 


- 


9.8 


- 


N 


- 


Turkish 


58 


65 


5 


5.6 


11.5 


11.6 


33.1 


- 


1.5 


5.5 


11.6 


33.3 


N 


- 
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linguistic data. Parsing was more focused on training 
and parsing with phrase structure trees and specifically 
English language because the Penn Treebank (Marcus 
et al., 1993) was the only available source for a long 
time. With the introduction of treebanks of different 
languages it is now possible to explore the bounds of 
multilingual parsing. 

The early efforts of data-driven dependency parsing 
were focused on translating dependency structures to 
phrase structure trees for which the parsers already 
existed. But it was realised quickly that doing this was 
not as trivial as previously thought. It is much more 
trivial and more intuitive to represent some phenomena 
with dependency trees rather than phrase structure trees 
such as local and global scrambling, in other words 
free word-order. Thus the incompatible translations of 
dependency structures to phrase structure trees resulted 
in varying degrees of loss of information. 

Collins et al. (1999) reports results on Czech. He 
translates the dependency trees to phrase structure 
trees in the flattest way possible and names the inter- 
nal nodes after part of speech tag of the head word of 
that node. He uses Model 2 in Collins (1999) and then 
evaluates the attachment score on the dependencies 
extracted from the resulting phrase structure trees of 
his parser. However, crossing dependencies cannot 
be translated into phrase structure trees (£akici and 
Baldridge, 2006) unless surface order of the words is 
changed. But Collins et al. (1999) does not mention 
crossing dependencies, therefore, we do not know how 
he handled non-projectivity. 

One of the earliest statistical systems that aims 
parsing dependency structures directly without an in- 
ternal representation of translation is Eisner (1 996). He 
proposes 3 different generative models. He evaluates 
them on the dependencies derived from the Wall Street 
Journal part of the Penn Treebank. Eisner reports 90 
percent for probabilistic parsing of English samples 
from WSJ. He reports 93 percent attachment score 
when gold standard tags were used, which means 93 
percent of all the dependencies are correct regardless 
of the percentage of the dependencies in each sentence. 
Eisner's parser is a projective parser thus it cannot 
inherently predict crossing dependencies. 

Discriminative dependency parsers such as Kudo and 
Matsumoto (2000, 2002), and Yamada and Matsumoto 
(2003) were also developed. They use support vector 
machines to predict the next action of a deterministic 



parser. Nivre et al. (2004) does this by memory-based 
learning. They are all deterministic parsers. 

McDonald et al. (2005b) tried something new and 
applied graph spanning algorithms to dependency 
parsing. They formalise dependency parsing as the 
problem of finding a maximum spanning tree in a di- 
rected graph. MIRA is used to determine the weights 
of dependency links as part of this computation. This 
algorithm has two major advantages: it runs in 0(n 2 ) 
time and it can handle non-projective dependencies 
directly. They show that this algorithm significantly 
improves performance on dependency parsing for 
Czech, especially on sentences which contain at least 
one crossed dependency. Variations of this parser has 
been used in CoNLL-X shared task and received the 
highest ranking among the participants averaged over 
the results of all of the 13 languages (Buchholz and 
Marsi, 2006). However, when no linguistic or global 
constraints are applied it may yield absurd dependency 
sequences such as assigning two subjects to a verb 
(Riedel et al., 2006). McDonald (2005a) uses MIRA 
learning algorithm with Eisner's parser and reports 
results for projective parsing. 



FUTURE TRENDS 

There is growing body of work on creating new tree- 
banks for different languages. Requirements for the 
design of these treebanks are at least as diverse as these 
natural languages themselves. For instance, some lan- 
guages have a much more strong morphological com- 
ponent or freer word order than others. Understanding 
and modelling these in the form of annotated linguistic 
data will guide the understanding of natural language, 
and technological advancement will hopefully make it 
easier to understand the inner workings of the language 
faculty of humans. There are challenges both for de- 
pendency parsing and for the dependency theory. For 
instance, modelling long-distance dependencies and, 
multiple head dependencies are still awaiting attention 
and there is much to do on morphemic dependency ap- 
proach where heads of phrases can be morphemes rather 
than words in a sentence for morphologically complex 
languages. Although these constitute a fraction of all the 
phenomena in natural languages, they are the "tricky" 
part of the NLP systems that will never be perfect as 
long as these natural phenomena are ignored. 
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CONCLUSION 

This article has reviewed dependency grammar theory 
together with recent advances in statistical dependency 
parsing for different languages. Some current challenges 
in building dependency treebanks and dependency 
parsing have also been discussed. Dependency theory 
and practical applications of dependency representa- 
tions have advantages and disadvantages. The fact that 
dependency parsing is easy to adapt to new languages, 
and is well-adapted to representing free word-order, 
makes it the preferred representation for many new 
linguistic corpora. Dependency parsing is also devel- 
oping in the direction of multi-lingual parsing where 
a single system is required to be successful with dif- 
ferent languages. This research may bring us closer to 
understanding the linguistic capacity of human brain, 
and thus to building better NLP systems. 
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KEY TERMS 

Corpus (corpora plural): A collection of 
written or spoken material in machine-readable 
form. 

Machine Translation (MT): The act of translat- 
ing something by means of a machine, especially 
a computer. 



Morpheme: The smallest unit of meaning. 
A word may consist of one morpheme (need), 
two morphemes (need/less, need/ing) or more 
(un/happi/ness). 

Phrase Structure Tree: Astructural representa- 
tion of a sentence in the form of an inverted tree, 
with each node of the tree labelled according to 
the phrasal constituent it represents. 

Rule-Based Parser: A parser that uses hand 
written (designed) rules as opposed to rules that 
are derived from the data. 

Statistical Parser: A group of parsing methods 
within NLP. The methods have in common that they 
associate grammar rules with a probability. 

Treebank: A text-corpus in which each 
sentence is annotated with syntactic structure. 
Syntactic structure is commonly represented as a 
tree structure. Treebanks can be used in corpus 
linguistics for studying syntactic phenomena or 
in computational linguistics for training or testing 
parsers. 



ENDNOTE 

1 Some languages are not included in both tasks. 
The information in the first and second columns 
of each set belong to CoNLL 2006 and 2007 
training data respectively. 
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INTRODUCTION 

Systems such as robotic systems and systems with large 
input-output data tend to be difficult to model using 
mathematical techniques. These systems have typically 
high dimensionality and have degrees of uncertainty in 
many parameters. Artificial intelligence techniques such 
as neural networks, fuzzy logic, genetic algorithms and 
evolutionary algorithms have created new opportunities 
to solve complex systems. Application of fuzzy logic 
[Bai, Y., Zhuang H. and Wang, D. (2006)] in particular, 
to model and solve industrial problems is now wide 
spread and has universal acceptance. Fuzzy modelling or 
fuzzy identification has numerous practical applications 
in control, prediction and inference. It has been found 
useful when the system is either difficult to predict and 
or difficult to model by conventional methods . Fuzzy set 
theory provides a means for representing uncertainties. 
The underlying power of fuzzy logic is its ability to 
represent imprecise values in an understandable form. 
The majority of fuzzy logic systems to date have been 
static and based upon knowledge derived from impre- 
cise heuristic knowledge of experienced operators, and 
where applicable also upon physical laws that governs 
the dynamics of the process. 

Although its application to industrial problems has 
often produced results superior to classical control, the 
design procedures are limited by the heuristic rules of 
the system. It is simply assumed that the rules for the 
system are readily available or can be obtained. This 
implicit assumption limits the application of fuzzy logic 
to the cases of the system with a few parameters. The 
number of parameters of a system could be large. The 
number of fuzzy rules of a system is directly dependent 
on these parameters. As the number of parameters in- 
crease, the number of fuzzy rules of the system grows 
exponentially. 

Genetic Algorithms can be used as a tool for the 
generation of fuzzy rules for a fuzzy logic system. This 
automatic generation of fuzzy rules, via genetic algo- 



rithms, can be categorised into two learning techniques, 
supervised and unsupervised. In this paper unsupervised 
learning of fuzzy rules of hierarchical and multi-layer 
fuzzy logic control systems are considered. In unsuper- 
vised learning there is no external teacher or critic to 
oversee the learning process. In other words, there are 
no specific examples of the function to be learned by the 
system. Rather, provision is made for a task-independent 
measure of the quality or representation that the system 
is required to learn. That is the system learns statistical 
regularities of the input data and it develops the ability 
to learn the feature of the input data and thereby create 
new classes automatically [Mohammadian, M., Nainar, 
I. and Kingham, M. (1997)]. 

To perform unsupervised learning, a competitive 
learning strategy may be used. The individual strings 
of genetic algorithms compete with each other for the 
"opportunity" to respond to features contained in the 
input data. In its simplest form, the system operates 
in accordance with the strategy that 'the fittest wins 
and survives'. That is the individual chromosome in a 
population with greatest fitness 'wins' the competition 
and gets selected for the genetic algorithms operations 
(cross-over and mutation). The other individuals in the 
population then have to compete with fit individual to 
survive. 

The diversity of the learning tasks shown in this 
paper indicates genetic algorithm's universality for 
concept learning in unsupervised manner. A hybrid 
integrated architecture incorporating fuzzy logic and 
genetic algorithm can generate fuzzy rules for problems 
requiring supervised or unsupervised learning. In this 
paper only unsupervised learning of fuzzy logic systems 
is considered. The learning of fuzzy rules and internal 
parameters in an unsupervised manner is performed 
using genetic algorithms. Simulations results have 
shown that the proposed system is capable of learn- 
ing the control rules for hierarchical and multi-layer 
fuzzy logic systems. Application areas considered are, 
hierarchical control of a network of traffic light control 
and robotic systems. 
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A first step in the construction of a fuzzy logic sys- 
tem is to determine which variables are fundamentally 
important. Any number of these decision variables may 
appear, but the more that are used, the larger the rule set 
that must be found. It is known [Raju, S., Zhou J. and 
Kisner, R. A. (1990), RajuG. V. S. and Zhou, J. (1993), 
Kingham, M., Mohammadian, M, and Stonier, R. J. 
(1998)], that the total number of rules in a system is an 
exponential function of the number of system variables. 
In order to design a fuzzy system with the required 
accuracy, the number of rules increases exponentially 
with the number of input variables and its associated 
fuzzy sets for the fuzzy logic system. A way to avoid the 
explosion of fuzzy rule bases in fuzzy logic systems is 
to consider Hierarchical Fuzzy Logic Control (HFLC) 
[RajuG. V. S. andZhou, J. (1993)]. Alearning approach 
based on genetic algorithms [Goldberg, D. (1989)] is 
discussed in this paper for the determination of the rule 
bases of hierarchical fuzzy logic systems. 



THE GENETIC FUZZY RULE 
GENERATOR ARCHITECTURE 

In this section we show how to learn the fuzzy rules 
in a fuzzy logic rule base using a genetic algorithm. 
The full set of fuzzy rules is encoded as a single string 
in the genetic algorithm population. To facilitate this 
we develop the genetic fuzzy rule generator whose 
architecture consists of five basic steps 

1. Divide the input and output spaces of the system 
to be controlled into fuzzy sets (regions), 



2. Encode the fuzzy rules into bit-string of and 

1, 

3. Use a genetic algorithm as a learning procedure 
to generate set of fuzzy rules, 

4. Use a fuzzy logic controller to assess the set of 
fuzzy rules and assign a value to each generated 
set of fuzzy rules, 

5. Stop generating new sets of fuzzy rules once some 
performance criteria is met, 

Figure 1 shows the genetic fuzzy rule generator 
architecture graphically. Suppose we wish to produce 
fuzzy rules for a fuzzy logic control with two inputs and 
single output. This simple two-input u ± , u 2 single-output 
y case is chosen in order to clarify the basic ideas of 
our new approach. Extensions to multi-output cases are 
straightforward. For more information on multi-output 
cases refer to Mohammadian et al [Mohammadian, M. 
and Stonier, R J., (1998)]. 

As a first step we divide the domain intervals of 
l/ 1? u 2 and y into different fuzzy sets. The number of 
the fuzzy sets is application dependent. Assume that 
we divide the interval for u ± , u 2 and y into 5, 7 and 7 
fuzzy sets respectively. For each fuzzy set we assign 
a fuzzy membership function. Therefore a maximum 
of 35 fuzzy rules can be constructed for this system. 
Now the fuzzy rule base can be formed as a 5x7 table 
with cells to hold the corresponding actions that must 
be taken given the condition corresponding to u 19 u 2 
are satisfied. 

In step 2 we encode the input and output fuzzy sets 
into bit-strings (of and 1). Each complete bit-string 
consists of 35 fuzzy rules for this example and each 



Figure 1. Genetic fuzzy rule generator architecture 
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fuzzy rule base has the same input conditions but 
may have different output control signal assigned to 
it. Therefore we need only to encode the output signal 
of the fuzzy rule bit-strings into a complete bit-string. 
This will save the processing time for encoding and 
decoding of genetic algorithm's strings. In this case the 
length of an individual string has been reduced from 
665 bits (i.e. 19x35) to 245 bits (i.e. 7x35). The choice 
of output control signal to be set for each fuzzy rule is 
made by the genetic algorithm. It initialises randomly 
a population of complete bit-strings. Each of these bit- 
strings is then decoded into fuzzy rules and evaluated 
by fuzzy logic controller to determine the fitness value 
for that bit-string. Application of proportional selection 
and mutation and one-point crossover operations can 
now proceed. Selection and crossover are the same as 
in simple genetic algorithms while the mutation opera- 
tion is modified. Crossover and mutation take place 
based on the probability of crossover and mutation 
respectively. The mutation operator is changed to suit 
this problem, namely, an allele is selected at random 
and it is replaced by a random number ranging from 1 



Figure 2. Three adjacent intersections 
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to 7 which represents in this example the five output 
fuzzy sets. The genetic algorithm process performs 
a self-directed search according to fitness value. In 
all applications in this paper we seek to minimise the 
fitness function. The process can be terminated after 
a desired number of generations or when the fitness 
value of the best string in a generation is less than 
some prescribed level. 



VARIABLE SELECTION AND RULE 
BASE DECOMPOSITION 

Traffic Light Control 

Traffic light control is widely used to resolve con- 
flicts among vehicle movements at intersections. The 
control system at each signalised intersection consists 
of the following three control elements, cycle time, 
phase splits and offset. Cycle time is the duration of 
completing all phases of a signal; phase split is the 
division of the cycle time into periods of green phase 
for competing approaches; and offset is the time dif- 
ference in the starting times of the green phases of 
adjacent intersections. 

In [Nainar, L, Mohammadian, M., Stonier, R. J. 
and Millar, J. (1996)] a fuzzy logic control scheme is 
proposed to overcome the lack of interactions between 
the neighbouring intersections. First, a traffic model is 
developed and a fuzzy control scheme for regulating 
the traffic flow approaching a single traffic intersection 
is proposed. A new fuzzy control scheme employing 
a supervisory fuzzy logic controller is then proposed 
to coordinate the three intersections based on the 
traffic conditions at all three intersections. Simula- 
tion results established the effectiveness of proposed 
scheme. Figure 2 shows the three intersections used 
in the simulation. 

A supervisory fuzzy logic control system is then 
developed to coordinate the three intersections far 
more effectively than the three local fuzzy logic con- 
trol systems. This is because using supervisory fuzzy 
logic controller each intersection is coordinated with 
all its neighbouring intersections [Nainar, I., Moham- 
madian, M., Stonier, R. J. and Millar, J. (1996)]. The 
fuzzy knowledge base of the supervisory fuzzy logic 
controller was learnt using genetic algorithms. The 
supervisory fuzzy logic controller developed to coor- 
dinate the three intersections coordinated the traffic 
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signals far more effectively than the three local fuzzy 
logic controllers. This is because using supervisory 
fuzzy logic controller each intersection is coordinated 
with all its neighbouring intersections. This proposed 
fuzzy logic control scheme can be effectively ap- 
plied to on-line traffic control because of its ability to 
handle extensive traffic situations. Simulations results 
have shown that the multi-layer fuzzy logic system 
consisting of three local fuzzy logic controllers and 
the supervisory fuzzy logic controller is capable of 
reducing the waiting time on the network of the traffic 
intersections [Nainar, I., Mohammadian, M., Stonier, 
R. J. and Millar, J. (1996)]. 

Collision-Avoidance in a Robot System 

Consider the following collision-avoidance problem in 
a simulated, point mass, two robot system. Athree-level 
hierarchical, fuzzy logic system was proposed to solve 
the problem, full details can be found in [Mohammadian, 
M. and Stonier, R. J. (1998)], see also [Mohammadian, 
M. and Stonier, R. J. (1995)]. In the first layer, two 
knowledge bases, one for each robot, are developed 
to find the steering angle to control each robot to its 
target. In the second layer two new knowledge bases 
are developed using the knowledge in the first layer 
to control the speed of each robot so that each robot 
approaches its target with near zero speed. Finally in 
the third layer, a single knowledge base is developed 
to modify the controls of each robot to avoid collision 
in a restricted common workspace, see Figure 3. 



In Figure 3, x; y gives the physical position of the 
robot on the plane, ct is the directional heading of the 
robot, q is the steering angle, D x andD 2 are the distances 
of the two robots from there respective targets, S 1 and 
S 2 are the speeds of the two robots, D is the distance 
between the two robots and q x , q 2 , q x , q 2 , s[ and s 2 
are the updates of variable outputs for the lasttwo layers. 
An important issue in this example is that of learning 
knowledge in a given layer sufficient for use in higher 
layers. In the first layer of the hierarchical fuzzy logic 
system, ignoring the possibility of collision, steering 
angles for the control of each robot to their associated 
target were determined by genetic algorithms. In the 
second layer genetic algorithm was used to determine 
adjustments to steering angle and speed of each robot 
to control the speed of the robot when arriving to its 
target. Next another layer is developed to adjust the 
speed and steering angle of the robots to avoid colli- 
sion of the robots. Consider the knowledge base of a 
single robot in layer one. It is not sufficient to learn a 
fuzzy knowledge base from an initial configuration and 
use this knowledge base for information on the steer- 
ing angle of the robot to learn fuzzy controllers in the 
second layer. Quite clearly this knowledge base is only 
guaranteed to be effective from this initial configura- 
tion as not all the fuzzy rules will have fired in taking 
the robot to its target. We have to find a knowledge 
base that is effective to some acceptable measure, in 
controlling the robot to its target from 'any' initial 
configuration. One way is to first learn a set of local 
fuzzy controllers, each knowledge base learnt by an 



Figure 3. Hierarchical structure for collision-avoidance 
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genetic algorithms from a given initial configuration 
within a set of initial configurations spread uniformly 
over the configuration space. These knowledge bases 
can then be fused through difuzzy amalgamation process 
[Stonier, R. J. and Mohammadian, M. (1995)] into the 
global (final), fuzzy control knowledge base. An alter- 
native approach [Mohammadian, M. and Stonier, R. J. 
(1996), Stonier, R. J. and Mohammadian, M. (1998)], 
is to develop an genetic algorithms to learn directly 
the 'final' knowledge base by itself over the region of 
initial configurations. 

In conclusion the proposed hierarchical fuzzy logic 
system is capable of controlling the multi-robot system 
successfully. By using hierarchical fuzzy logic system 
the number of control laws is reduced. In the first layer 
of hierarchical fuzzy logic system ignoring the pos- 
sibility of collision, steering angles for the control of 
each robot to their associated target were determined 
by genetic algorithms. In the second layer genetic al- 
gorithm was used to determine adjustments to steering 
angle and speed of each robot to control the speed of 
the robot when arriving to its target. Next another layer 
is developed to adjust the speed and steering angle of 
the robots to avoid collision of the robots. If only one 
fuzzy logic system was used to solve this problem with 
the inputs x, y, § of each robot and D each with the 
same fuzzy sets described in this paper then there would 
be 153125 fuzzy rule needed for its fuzzy knowledge 
base. Using a hierarchical fuzzy logic system there 
is a total number of 1645 fuzzy rules for this system. 
The hierarchical concept learning using the proposed 
method makes easier the development of fuzzy logic 
control systems, by encouraging the development of 
fuzzy logic controllers where the large number of 
systems parameters inhibits the construction of such 
controllers. For more details, we refer the reader to 
the cited papers. 



ISSUES IN RULE BASE 
IDENTIFICATION 



knowledge base and the membership functions in the 
inference process. This is usually accomplished by using 
a genetic algorithms to produce the "best" fuzzy rules 
and membership functions/parameters with respect 
to an optimisation criterion. There are three main ap- 
proaches in the literature for learning the rules in a fuzzy 
knowledge base. They are, the Pittsburgh approach, 
the Michigan approach and the iterative rule-learning 
approach [Cordon, O., Herrera, R, Hoffmann, F. and 
Magdalena, L. (2001)]. The Pittsburgh and Michigan 
approaches are the most commonly used methods in 
the area. 

Research by the authors, colleagues and postgradu- 
ate students has predominately used the Pittsburgh 
approach with success in learning the fuzzy rules 
in complex systems, across hierarchical and multi- 
layered structures in problems [Stonier, R. J. and 
Zajaczkowski, J. (2003), Kingham, M., Mohammad- 
ian, M, and Stonier, R. J. (1998), Mohammadian, M. 
and Kingham, M. (2004), Mohammadian, M. (2002), 
Nainar, I., Mohammadian, M., Stonier, R. J. and Mil- 
lar, J. (1996), Mohammadian, M. and Stonier, R. J. 
(1998), Stonier, R. J. and Mohammadian, M. (1995), 
Thomas, P. J. and Stonier, R. J. (2003), Thomas, P. J. 
and Stonier, R. J. (2003a)]. 



FUTURE TRENDS 

In using the Pittsburgh approach the coding of the 
fuzzy rule base as a linear string in an evolutionary 
algorithm has its drawbacks other than the string may 
even be relatively large in length under decomposi- 
tion into multi-layer and hierarchical structures. One 
is that this is a specific linear encoding of a nonlinear 
structure and typical one-point crossover when imple- 
mented introduces bias when reversing the coding to 
obtain the fuzzy logic rule base. Using co-evolutionary 
algorithms is also another option that needs further 
investigation. 



Research into this area has been described as genetic 
fuzzy systems using the classical genetic algorithm 
has been surveyed by Cordon [Cordon, O., Herrera, 
F. and Zwir, I. (2002)], see also [Cordon, O., Herrera, 
F., Hoffmann, F. and Magdalena, L. (2001)]. Genetic 
algorithms is employed to learn or tune different com- 
ponents of a fuzzy logic system such as the fuzzy 



CONCLUSION 

This paper described the issues in the construction of 
a hierarchical fuzzy logic system to model a complex 
(nonlinear) system. The learning of fuzzy rules in such 
systems using genetic algorithms was proposed and it 
was shown to be feasible. Whilst the decomposition 
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into hierarchical/multi-layered fuzzy logic sub-sys- 
tems reduces greatly the number of fuzzy rules to be 
defined and to be learnt, other issues arise such as the 
decomposition is not unique and that it may give rise 
to variables with no physical significance. This can 
raise then major difficulties in obtaining a complete 
class of rules from experts even when the number of 
variables is small. 
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KEY TERMS 

Fusing Variables: Fusing variables is a method for 
reducing the number of rules in a fuzzy rule base. The 
variables are fused (combined) together before input 
into the inference engine, thereby reducing the number 
of rules in the knowledge base. 

Fuzzy Logic: Fuzzy sets and Fuzzy Logic were 
introduced in 1965 by Lotfi Zadeh as a new way to 
represent vagueness in applications. They are a gener- 
alisation of sets in conventional set theory. Fuzzy Logic 
(FL) aims at modelling imprecise models of reasoning, 
such as common sense reasoning for uncertain complex 
processes. A system for representing the meaning of 
lexically imprecise proposition in natural language 
structure through the proposition being represented 
as fuzzy constraints on a variable is provided. Fuzzy 
logic controllers have been applied to many nonlinear 
control systems successfully. Linguistic rather than crisp 
numerical rules are used to control the processes. 

Fuzzy Rule Base (Fuzzy If-Then rules): Fuzzy 
If-Then or fuzzy conditional statements are expressions 
of the form "If A Then B", where A and B are labels 
of fuzzy sets characterised by appropriate membership 
functions. Due to their concise form, fuzzy If-Then 
rules are often employed to capture the imprecise 
modes of reasoning that play an essential role in the 
human ability to make decision in an environment of 
uncertainty and imprecision. The set of If-Then rules 
relate to a fuzzy logic system that are stored together 
is called a Fuzzy Rule Base. 

Genetic Algorithms: Genetic Algorithms (GAs) are 
algorithms that use operations found in natural genetics 
to guide their way through a search space and are increas- 
ingly being used in the field of optimisation. The robust 
nature and simple mechanics of genetic algorithms make 
them inviting tools for search, learning and optimiza- 
tion. Genetic algorithms are based on computational 
models of fundamental evolutionary processes such as 
selection, recombination and mutation. 

Genetic Algorithms Components: In its simplest 
form, a genetic algorithm has the following compo- 
nents: 
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1. Fitness - A positive measure of utility, called fit- 
ness, is determined for individuals in a population. 
This fitness value is a quantitative measure of how 
well a given individual compares to others in the 
population. 

2. Selection - Population individuals are assigned a 
number of copies in a mating pool that is used to 
construct a new population. The higher a popula- 
tion individual's fitness, the more copies in the 
mating pool it receives. 

3. Recombination - Individuals from the mating pool 
are recombined to form new individuals, called 
children. A common recombination method is 
one-point crossover. 

4. Mutation - Each individual is mutated with some 
small probability « 1.0. Mutation is a mechanism 
for maintaining diversity in the population. 



Hierarchical Fuzzy Logic Systems: The idea of 
hierarchical fuzzy logic control systems is to put the 
input variables into a collection of low-dimensional 
fuzzy logic control systems, instead of creating a single 
high dimensional rule base for a fuzzy logic control 
system. Each low-dimensional fuzzy logic control 
system constitutes a level in the hierarchical fuzzy 
logic control system. Hierarchical fuzzy logic control 
is one approach to avoid rule explosion problem. It 
has the property that the number of rules needed to 
construct the fuzzy system increases only linearly with 
the number of variables in the system 

Unsupervised Learning: In unsupervised learn- 
ing there is no external teacher or critic to oversee the 
learning process. In other words, there are no specific 
examples of the function to be learned by the system. 
Rather, provision is made for a task-independent mea- 
sure of the quality or representation that the system is 
required to learn. That is the system learns statistical 
regularities of the input data and it develops the ability 
to learn the feature of the input data and thereby create 
new classes automatically. 
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INTRODUCTION 

Human intelligence is acquired through a prolonged 
period of maturation and growth during which a single 
fertilized egg first turns into an embryo, then grows 
into a newborn baby, and eventually becomes an adult 
individual — which, typically before growing old and 
dying, reproduces. The developmental process is 
inherently robust and flexible, and biological organisms 
show an amazing ability during their development to 
devise adaptive strategies and solutions to cope with 
environmental changes and guarantee their survival. 
Because evolution has selected development as the 
process through which to realize some of the highest 
known forms of intelligence, it is plausible to assume 
that development is mechanistically crucial to emulate 
such intelligence in human-made artifacts. 



BACKGROUND 

The idea that development might be a good avenue 
to understand and construct cognition is not new. 
Already Turing (1950) suggested that using some kind 
of developmental approach might be a good strategy. 
In the context of robotics, many of the original ideas 
can be traced back to embodied artificial intelligence 
(embodied AI), a movement started by Rodney 
Brooks at the beginning of the 1980s (Brooks et al., 
1998), and the notion of enaction (Varela et al., 1991) 
according to which cognitive structures emerge from 
recurrent sensorimotor patterns that enable action 
to be perceptually guided. Researchers of embodied 
AI believe that intelligence can only come from the 
reciprocal interaction across multiple time scales 
between brain and body of an agent, and its environment. 
In a sense, throughout life, experience is learned and 
common sense is acquired, which then supports more 
complex reasoning. This general bootstrapping of 



intelligence has been called "cognitive incrementalism" 
(Clark, 2001). 



DEVELOPMENTAL ROBOTICS 

Developmental robotics (also known as epigenetic 
or ontogenetic robotics) is a highly interdisciplinary 
subfield of robotics in which ideas from artificial 
intelligence, developmental psychology, neuroscience, 
and dynamical systems theory play a pivotal role in 
motivating the research (Asada etal, 2001; Lungarella 
et a/., 2003; Weng et a/., 2001; Zlatev & Balkenius, 
2001). Developmental robotics aims to model the 
development of increasingly complex cognitive 
processes in natural and artificial systems and to 
understand how such processes emerge through physical 
and social interaction. The idea is to realize artificial 
cognitive systems not by simply programming them 
to solve a specific task, but rather by initiating and 
maintaining a developmental process during which 
the systems interact with their physical environments 
(i.e. through their bodies or tools), as well as with their 
social environments (i.e. with people or other robots). 
Cognition, after all, is the result of a process of self- 
organization (spontaneous emergence of order) and 
co-development between a developing organism and its 
surrounding environment. Although some researchers 
use simulated environments and computational 
models (e.g. Mareschal et al., 2007), often robots 
are employed as testing platforms for theoretical 
models of the development of cognitive abilities 
- the rationale being that if a model is instantiated in 
a system interacting with the real world, a great deal 
can be learned about its strengths and potential flaws 
(Fig. 1). Unlike evolutionary robotics which operates 
on phylogenetic time scales and populations of many 
individuals, developmental robotics capitalizes on 
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Developmental Robotics 



Figure 1. Developmental robots, (a) iCub (http://www.robotcub.org) (b) Babybot (http://www.liralab.it/babybot/robot.htm) 
(c) Infanoid (http://univ. nict.go.jp/people/xkozima/infanoid/robot-eng. htmlMnfanoid) . 





"short" (ontogenetic) time scales and single individuals 
(or small groups of individuals). 



AREAS OF INTEREST 

The spectrum of developmental robotics research 
can be roughly segmented into four primary areas of 
interest. Although instances may exist that fall into 
multiple categories, the suggested grouping should 
provide at least some order in the large spectrum of 
issues addressed by developmental roboticists. 

Socially oriented interaction: This category includes 
research on robots that communicate or learn particular 
skills via social interaction with humans or other robots. 
Examples are imitation learning, communication and 
language acquisition, attention sharing, turn-taking 
behavior, and social regulation (Dautenhahn, 2007; 
Steels, 2006). 

Non-social interaction: Studies on robots 
characterized by a direct and strong coupling between 
sensorimotor processes and the local environment 
(e.g. inanimate objects), but which do not interact 



with other robots or humans. Examples are visually- 
guided grasping and manipulation, tool-use, perceptual 
categorization, and navigation (Fitzpatrick eta/., 2007; 
Nabeshima et al., 2006). 

Agent-centered sensorimotor control: In these 
studies, robots are used to investigate the exploration of 
bodily capabilities, the effect of morphological changes 
on motor skill acquisition, as well as self-supervised 
learning schemes not linked to any functional goal. 
Examples include self-exploration, categorization of 
motor patterns, motor babbling, and learning to walk or 
crawl (Demiris & Meltzoff, 2007; Lungarella, 2004). 

Mechanisms and principles: This category embraces 
research on principles, mechanisms or processes 
thought to increase the adaptivity of a behaving system. 
Examples are: developmental and neural plasticity, 
mirror neurons, motivation, freezing and freeing of 
degrees of freedom, and synergies; characterization 
of complexity and emergence, study of the effects of 
adaptation and growth, and practical work on body 
construction or development (Arbib et al., 2007; 
Oudeyer et al., 2007; Lungarella & Sporns, 2006). 
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PRINCIPLES FOR DEVELOPMENTAL 
SYSTEMS 

By contrast to traditional disciplines such as physics 
or mathematics, which are described by well-known 
basic principles, the fundamental principles governing 
the dynamics of developmental systems are unknown. 
Could there be laws governing developmental systems 
or a theory? Although various attempts have been 
initiated (Asada et a/., 2001; Brooks et a/., 1998; 
Weng et a/., 2001), it is fair to say that to date no such 
theory has emerged. Here, en route to such a theory, 
we point out a set of candidate principles. An approach 
based on principles is preferable for constructing 
intelligent autonomous systems, because it allows 
capturing design ideas and heuristics in a concise and 
pertinent way, avoiding blind trial-and-error. Principles 
can be abstracted from biological systems, and their 
inspiration can take place at several levels, ranging from 
a "faithful" replication of biological mechanisms to a 
rather generic implementation of biological principles 
leaving room for dynamics intrinsic to artifacts but 
not found in natural systems. In what follows we 
summarize five key principles revealed by observations 
of human development which may be used to construct 
developmental robots. 

The Value Principle 

Observations: Value systems are neural structures 
that mediate value and saliency and are found in 
virtually all vertebrate species. They are necessary 
for an organism's behavioral adaptation to salient 
(meaningful) environmental cues. By linking behavior 
and neuroplasticity, value systems are essential for 
deciding what to do in a particular situation (Sporns, 
2007). 

Lessons for robotics: The action of value systems 
-through adaptive changes in sensorimotor connections 
and inputs - enables an embodied agent to learn action 
strategies without external supervision by increasing the 
likelihood that a "good" movement pattern can recur in 
the same behavioral context. Value systems may also be 
used to guide an exploratory process and hence allow a 
system to learn sensorimotor patterns more efficiently 
compared to a pure random or a systematic exploration 
(Gomez & Eggenberger, 2007). By imposing constraints 
through value-dependent modulation of saliency, the 



search space can be considerably reduced. Examples 
of value systems in the brain include the dopaminergic, 
cho-linergic, and noradrenergic systems; based on them, 
several models have been implemented and embedded 
in developmental robots (Sporns, 2007). 

The Principle of Information 
Self-Structuring 

Observations: Infants frequently engage in repetitive 
(seemingly dull) behavioral patterns: they look at 
objects, grasp them, stick them into their mouths, 
bang them on the floor, and so on. It is through such 
interactions that intelligence in humans develops as 
children grow up interacting with their environment 
(Smith & Breazeal, 2007; Smith & Gasser, 2005). 

Lessons for robotics: The first important lesson is 
that information processing (neural coding) needs to 
be considered in the context of the embeddedness of 
the organism within its eco-niche. That is, robots and 
organisms are exposed to a barrage of sensory data 
shaped by sensorimotor interactions and morphology 
(Lungarella & Sporns, 2006). Information is not 
passively absorbed from the surrounding environment 
but is selected and shaped by actions on the environment. 
Second, information structure does not exist before 
the interaction occurs, but emerges only while the 
interaction is taking place. The absence of interaction 
would lead to a large amount of unstructured data 
and consequently to stronger requirements on neural 
coding, and - in the worst case - to the inability to 
learn. It follows that embodied interaction lies at the 
root of a powerful learning mechanism as it enables the 
creation of time-locked correlations and the discovery 
of higher-order regularities that transcend the individual 
sensory modalities. [Lungarella (2004; "principle of 
information self -structuring")] . 

The Principle of Incremental Complexity 

Observations: Infants' early experiences are strongly 
constrained by the immaturity of their sensory, motor, 
and neural systems. Such early constraints, which at first 
appear to be an inadequacy, are in fact of advantage, 
because they effectively decrease the "information 
overload" that otherwise would overwhelm the infant 
(Bjorklund & Green, 1992). 

Lessons for robotics: In order for an organism 
- natural or artificial - to learn to control its own 
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complex brain-body system, it might be a good strategy 
to start simple and gradually build on top of acquired 
abilities. The well-timed and gradual co-development 
of body morphology and neural system provides an 
incremental approach to deal with a complex and 
unpredictable world. Early "morphological constraints" 
and "cognitive limitations" can lead to more adaptive 
systems as they allow exploiting the role that experience 
plays in shaping the "cognitive" architecture. If an 
organism was to begin by using its full complexity, it 
would never be able to learn anything (Gomez et a/., 
2004). It follows that designers should not try to "code" 
a full-fledged ready-to-be-used intelligence module 
directly into an artificial system. Instead, the system 
should be able to discover on its own the most effective 
ways of assembling low-level components into novel 
solutions [Lungarella (2004; "starting simple"); Pfeif er 
& Bongard (2007; "incremental process principle")]. 

The Principle of Interactive Emergence 

Observations: Development is not determined by innate 
mechanisms alone (in other words: not everything 
should be pre-programmed). Cognitive structure, for 
instance, is largely dependent on the interaction history 
of the developing system with the environment in which 
it is embedded (Hendriks-Jansen, 1996). 

Lessons for robotics: In traditional engineering 
the designer of the system imposes ("hard-wires") the 
structure of the controller and the controlled system. 
Designers of adaptive robots, however, should avoid 
implementing the robot's control structure according 
to their understanding of the robot's physics, but 
should endow the robot with means to acquire its 
own understanding through self-exploration and 
interaction with the environment. Systems designed 
for emergence tend to be more adaptive with respect to 
uncertainties and perturbations. The ability to maintain 
performance in the face of changes (such as growth or 
task modifications) is a long-recognized property of 
living systems. Such robustness is achieved through a 
host of mechanisms: feedback, modularity, redundancy, 
structural stability, and plasticity [Dautenhahn (2007; 
"interactive emergence"); Hendriks-Jansen (1996; 
"interactive emergence"); Prince et al. (2005; "ongoing 
emergence")]. 



The Principle of Cognitive Scaffolding 

Observations: Development takes place among 
conspecifics with similar internal systems and similar 
external bodies (Smith & Breazeal, 2007). Human 
infants, for instance, are endowed from an early age 
with the means to engage in simple, but nevertheless 
crucial social interactions, e.g. they show preferences 
for human faces, smell, and speech, and they imitate 
protruding tongues, smiles, and other facial expressions 
(Demiris & Meltzoff, 2007). 

Lessons for robotics: Social interaction bears 
many potential advantages for developmental robots: 
(a) it increases the system's behavioral diversity 
through mimicry and imitation (Demiris & Meltzoff, 
2007); (b) it supports the emergence of language and 
communication, and symbol grounding (Steels, 2006); 
and (c) it helps structure the robot's environment by 
simplifying and speeding up the learning of tasks and the 
acquisition of skills. Scaffolding is often employed by 
parents and caretakers (intentionally or not) to support, 
shape, and guide the development of infants. Similarly, 
the social world of the robot should be prepared to teach 
the robot progressively novel and more complex tasks 
without overwhelming its artificial cognitive structure 
[Lungarella (2004; "social interaction principle"); 
Mareschal et al. (2007; "ensocialment"); Smith & 
Breazeal (2007; "coupling to intelligent others")]. 



FUTURE TRENDS 

The further success of developmental robotics 
will depend on the extent to which theorists and 
experimentalists will be able to identify universal 
principles spanning the multiple levels at which 
developmental systems operate. Here, we briefly 
indicate some "hot" issues that need to be tackled en 
route to a theory of developmental systems. 

Semiotics: It is necessary to address the issue of how 
developmental robots (and embodied agents in general) 
can attribute meaning to symbols and construct semiotic 
systems. A promising approach, explored under the 
label of "semiotic dynamics", is that such semiotic 
systems and the associated information structure are 
continuously invented and negotiated by groups of 
people or agents, and are used for communication and 
information organization (Steels, 2006). 
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Core knowledge: An organism cannot develop 
without some built-in ability. If all abilities are built 
in, however, the organism does not develop either. It 
will therefore be important to understand with what 
sort of core knowledge and explorative behaviors a 
developmental system has to be endowed, so that it 
can autonomously develop novel skills. One of the 
greatest challenges will be to identify core abilities 
and how they interact during development in building 
basic skills (Spelke, 2000). 

Core motives: It is necessary to conduct research 
on general capacities such as creativity, curiosity, 
motivations, action selection, and prediction (i.e. the 
ability to foresee consequence of actions). Ideally, no 
tasks should be pre-specified to the robot, which should 
only be provided with an internal abstract reward 
function and a set of basic motivational (or emotional) 
drives that could push it to continuously master new 
know-how and skills (Lewis, 2000; Oudeyer et al., 
2007). 

Self-exploration: Another important challenge 
is the one of self-exploration or self-programming 
(Bongard et al., 2006). Control theory assumes that 
target values and states are initially provided by the 
system's designer, whereas in biology, such targets 
are created and revised continuously by the system 
itself. Such spontaneous "self-determined evolution" 
or "autonomous development" is beyond the scope of 
current control theory and needs to be addressed in 
future research. 

Learning causality: In a natural setting, no teacher 
can possibly provide a detailed learning signal and 
sufficient training data. Mechanisms will have to 
be created to characterize learning in an "ecological 
context" and for the developing agent to collect relevant 
learning material on its own. One significant future 
avenue will be to endow systems with the possibility 
to recognize progressively longer chains of cause and 
effect (Chater et al., 2006). 

Growth: As mentioned in the introduction, 
intelligence is acquired through a process of self- 
assembly, growth, and maturation. It will be important 
to study how physical growth, change of shape and 
body composition, as well as material properties of 
sensors and actuators affect and guide the emergence 
of cognition. This will allow connecting developmental 
robotics to computational developmental biology 
(Gomez & Eggenberger, 2007; Kumar & Bentley, 
2003). 



CONCLUSION 

The study of intelligent systems raises many 
fundamental, but also very difficult questions. Can 
machines think or feel? Can they autonomously acquire 
novel skills? Can the interaction of the body, brain, and 
environment be exploited to discover novel and creative 
solutions to problems? Developmental robotics maybe 
an approach to explore such long standing issues. At this 
point, the field is bubbling with activity. Its popularity 
is partly due to recent technological advances which 
have allowed the design of robots whose "kinematic 
complexity" is comparable to that of humans (Fig. 1). 
The success of developmental robotics will ultimately 
depend on whether it will be possible to crystallize its 
central assumptions into a theory. While much additional 
work is surely needed to arrive at or even approach a 
general theory of intelligence, the beginnings of a new 
synthesis are on the horizon. Perhaps, finally, we will 
come closer to understanding and building (growing) 
human-like intelligence. Exciting times are ahead of 
us. 
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KEY TERMS 

Adaptation: Refers to particular adjustments that 
organisms undergo to cope with environmental and 
morphological changes. In biology one can distinguish 
four types of adaptation: evolutionary, physiological, 
sensory, and learning. 

Bootstrapping: Designates the process of starting 
with a minimal set of functions and building increasingly 
more functionality in a step by step manner on top of 
structures already present in the system. 

Degrees of freedom problem: The problem 
of learning how to control a system with a very 
large number of degrees of freedom (also known as 
Bernstein's problem). 

Embodiment: Refers to the fact that intelligence 
requires a body, and cannot merely exist in the form 
of an abstract algorithm. 



Emergence: Aprocess where phenomena at a certain 
level arise from interactions at lower levels. The term 
is sometimes used to denote a property of a system not 
contained in any one of its parts. 

Scaffolding: Encompasses all kinds of external 
support and aids that simplify the learning of tasks and 
the acquisition of new skills. 

Semiotic Dynamics: Field that studies how 
meaningful symbolic structures originates, spreads, 
and evolve over time within populations, by combining 
linguistics and cognitive science with theoretical tools 
from complex systems and computer science. 
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INTRODUCTION 

This chapter starts from an exact gate-level reliability 
analysis of von Neumann multiplexing using majority 
gates of increasing fan-ins (A = 3, 5, 7, 9, 11) at the 
smallest redundancy factors (R F = 2A), and details an 
accurate device-level analysis. The analysis comple- 
ments well-known theoretical and simulation results. 
The gate-level analysis is exact as obtained using ex- 
haustive counting. The extension (of the exact gate-level 
analysis) to device-level errors will allow us to analyze 
von Neumann majority multiplexing with respect to 
device malfunctions. These results explain abnormal 
behaviors of von Neumann multiplexing reported based 
on Monte Carlo simulations. These analyses show that 
device-level reliability results are quite different from 
the gate-level ones, and could have profound implica- 
tions for future (nano)circuit designs. 

SI A (2005) predicts that the semiconductor industry 
will continue its success in scaling CMOS for a few 
more generations. This scaling should become very 
difficult when approaching 16 nm. Scaling might 
continue further, but alternative nanodevices might be 
integrated with CMOS on the same platform. Besides 
the higher sensitivities of future ultra-small devices, 
the simultaneous increase of their numbers will create 
the ripe conditions for an inflection point in the way 
we deal with reliability. 

With geometries shrinking the available reliability 
margins of the future nano(devices) are considerably 
being reduced (Constantinescu, 2003), (Beiu et al., 
2004). From the chip designers' perspective, reliability 
currently manifests itself as time-dependent uncer- 
tainties and variations of electrical parameters. In the 
nano-era, these device-level parametric uncertainties 



are becoming too high to handle with prevailing worst- 
case design techniques — without incurring significant 
penalty in terms of area, delay, and power/energy. The 
global picture is that reliability looks like one of the 
greatest threats to the design of future ICs. For emerg- 
ing nanodevices and their associated interconnects 
the anticipated probabilities of failures, could make 
future nano-ICs prohibitively unreliable. The present 
design approach based on the conventional zero-defect 
foundation is seriously being challenged. Therefore, 
fault- and defect-tolerance techniques will have to be 
considered from the early design phases. 

Reliability for beyond CMOS technologies 
(Hutchby et al., 2002) (Waser, 2005) is expected to 
get even worse, as device failure rates are predicted 
to be as high as 10% for single electron technology, 
or SET (Likharev, 1999), going up to 30% for self-as- 
sembled DNA (Feldkamp & Niemeyer, 2006) (Lin et 
al., 2006). Additionally, a comprehensive analysis of 
carbon nano tubes for future interconnects (Massoud 
& Nieuwoudt, 2006) estimated the variations in delay 
at about 60% from the nominal value. Recently, defect 
rates of 60% were reported for a 160 Kbit molecular 
electronic memory (Green et al., 2007). Achieving 
100% correctness with 10 12 nanodevices will be not 
only outrageously expensive, but plainly impossible! 
Relaxing the requirement of 100% correctness should 
reduce manufacturing, verification, and test costs, while 
leading to more transient and permanent errors. It fol- 
lows that most (if not all) of these errors will have to 
be compensated by architectural techniques (Nikolic 
et al., 2001) (Constantinescu, 2003) (Beiu et al., 2004) 
(Beiu & Riickert, 2009). 

From the system design perspective errors fall 
into: permanent (defects), intermittent, and transient 



Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. 



Device-Level Majority von Neumann Multiplexing 



(faults). The origins of these errors can be found in the 
manufacturing process, the physical changes appearing 
during operation, as well as sensitivity to internal and 
external noises and variations. It is not clear if emerging 
nanotechnologies will not require new fault models, 
or if multiple errors might have to be dealt with. Kuo 
(2006) even mentioned that: "we are unsure as to 
whether much of the knowledge that is based on past 
technologies is still valid for reliability analysis. " The 
well-known approach for fighting against errors is to 
incorporate redundancy: either static (in space, time, 
or information) or dynamic (requiring fault detection, 
location, containment, and recovery). Space (hardware) 
redundancy relies on voters (generic, inexact, mid- 
value, median, weighted average, analog, hybrid, etc.) 
and includes: modular redundancy, cascaded modular 
redundancy, and multiplexing like von Neumann mul- 
tiplexing vN-MUX (von Neumann, 1952), enhanced 
vN-MUX (Roy & Beiu, 2004), and parallel restitution 
(Sadek et al., 2004). Time redundancy is trading space 
for time, while information redundancy is based on 
error detection and error correction codes. 

This chapter explores the performance of vN-MUX 
when using majority gates of fan-in A (MAJ-A). The 
aim is to get a clear understanding of the trade-offs 
between the reliability enhancements obtained when 
using MAJ-A vN-MUX at the smallest redundancy 
factors R p = 2 A (see Fig. 1) on one side, versus both 
the fan-ins and the unreliable nanodevices on the other 
side. We shall start by reviewing some theoretical and 
simulation results for vN-MUX in Background section. 



Exact gate-level simulations (as based on an exhaus- 
tive counting algorithm) and accurate device-level 
estimates, including details of the effects played by 
nanodevices on MAJ-A vN-MUX, are introduced in 
the Main Focus of the Chapter section. Finally, implica- 
tions and future trends are discussed in Future Trends, 
and conclusions and further directions of research are 
ending this chapter. 



BACKGROUND 

Multiplexing was introduced by von Neumann as 
a scheme for reliable computations (von Neumann, 
1952). vN-MUX is based on successive computing 
stages alternating with random interconnection stages. 
Each computing stage contains a set of redundant 
gates. Although vN-MUX was originally exemplified 
for NAND-2 it can be implemented using any type of 
gate, and could be applied to any level of abstraction 
(subcircuits, gates, or devices). The ' multiplexing' of 
each computation tries to reduce the likelihood of er- 
rors propagating further, by selecting the more-likely 
result(s) at each stage. Redundancy is quantified by a 
redundancy factor R p , which indicates the multiplicative 
increase in the number of gates (subcircuits, or devices). 
In his original study, von Neumann (1952) assumed 
independent (un-correlated) gate failures pf GATE and 
very large R p . The performance of NAND-2 vN-MUX 
was compared with other fault-tolerant techniques in 
(Forshaw et al., 2001), and it was analyzed at lower 



Figure 1. Minimum redundancy MAJ-A vN-MUX: (a) MAJ-3 (R p = 6); and (b) MAJ-5 (R p = 10) 
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R F (30 to 3,000) in (Han & Jonker, 2002), while the 
first exact analysis at very low R F (3 to 1 00) for MAJ-3 
vN-MUX was done in (Roy & Beiu, 2004). 

The issue of which gate should one use is debatable 
(Ibrahim & Beiu, 2007). It was proven that using MAJ-3 
could lead to improved vN-MUX computations only for 
Pi maj-3 < °- 0197 ( von Neumann, 1952), (Roy & Beiu, 
2004). This outperforms the NAND-2 error threshold 
P/nand-2 < °- 0107 ( von Neumann, 1952), (Sadek et al., 
2004). Several other studies have shown that the error 
thresholds of MAJ are higher than those of N AND when 
used in vN-MUX. Evans (1994) proved that: 



The results based on exhaustive counting when 
varying pf MAJA have confirmed both the theoretical and 
the simulation ones. MAJ-A vN-MUX at the minimum 
redundancy factor R p = 2A improves the reliability 
over MAJ-A when pf MAJ _ A < 10%, and increasing A 
increases the reliability. When pf MAJ _ A 10%, using vN- 
MUX increases the reliability over that of MAJ-A as 
long as pi,„ . is lower than a certain error threshold. If 

° *l MAJ-A 

pf^^ A is above the error threshold, the use of vN-MUX 

rl MAJ-A ' 

is detrimental, as the reliability of the system is lower 
than that of MAJ-A. Still, these do not explain the Monte 
Carlo simulation results mentioned earlier. 
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while the error threshold for NAND-A was determined 
in (Gao et al., 2005) by solving: 
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An approach for getting a better understanding of 
vN-MUX at very small R F is to use Monte Carlo simu- 
lations (Beiu, 2005), (Beiu & Sulieman, 2006), (Beiu 
et al., 2006). These have revealed that the reliability of 
NAND-2 vN-MUX is in fact better than that of MAJ-3 
vN-MUX (at R F = 6) for small geometrical variations 
v. As opposed to the theoretical results — where the 
reliability of MAJ-3 vN-MUX is always better than 
NAND-2 vN-MUX— the Monte Carlo simulations 
showed that MAJ-3 vN-MUX is better than NAND-2 
vN-MUX, but only for v > 3.4%. Such results were 
neither predicted (by theory) nor suggested by (gate- 
level) simulations. 

It is to be highlighted here that all the theoretical 
publications discuss unreliable organs, gates, nodes, 
circuits, or formu/as, but very few mention devices. For 
getting a clear picture we have started by developing an 
exhaustive counting algorithm which exactly calculates 
the reliability of MAJ-A vN-MUX (Beiu et al., 2007). 
The probability of failure of MAJ-A vN-MUX is: 
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Both the original vN-MUX study and the subsequent 
theoretical ones have considered unreliable gates. They 
did not consider the elementary devices, and assumed 
that the gates have a fixed (bounding) pf GATE - This as- 
sumption ignores the fact that different gates are built 
using different (numbers of) devices, logic styles, or 
(novel) technological principles. While a standard 
CMOS inverter has 2 transistors, NAND-2 and MAJ-3 
have 4 and respectively 10 transistors. Forshaw et al. 
(2001) suggested that pf could be estimated as: 
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where 8 denotes the probability of failure of a nanode- 
vice (e.g., transistor, junction, capacitor, molecule, 
quantum dot, etc.), and n is the number of nanodevices 
a gate has. 



Using eq. 4 as pf 



1 - (1 - e) 2A the reliabilities 



have been estimated by modifying the exact counting 
results reported in (Beiu et al., 2007). The device-level 
estimates can be seen in Fig. 2. They show that increas- 
ing A will not necessarily increase the reliability of MA J- 
A vN-MUX (over MAJ-A). This is happening because 
MAJ-A with larger A require more nanodevices. In 
particular, while MAJ-1 1 vN-MUX is the best solution 
for 8 < 1 %o (Fig. 2(a)), it becomes the worst for 8 > 2% 
(Fig. 2(b)). Hence, larger fan-ins are advantageous for 
lower 8 (< l%o), while small fan-ins perform better for 
larger 8 (> 1 %). Obviously, there is a "swapping" region 
where the ranking is being reversed. We detail Fig. 2(b) 
for 8 > 1% (Fig. 2(c)) and for the "swapping" region 
l%o < 8 < 2% (Fig. 2(d), where 'o' marks show the 
envelope). These results imply that increasing A and/or 



473 



Device-Level Majority von Neumann Multiplexing 



Fig. 2. Probability of failure of MA J- A vN-MUX plotted versus the probability of failure of the elementary 
(nano) device s: (a) small s (< l%o); (b) large s (> 1%); (c) detail for s in between 1% and 10%; (d) detailed 
view of the <( swapping" region (s in between l%o and , 
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R F does not necessarily improve the overall system's 
reliability. This is because increasing A and/or R F leads 
to increasing the number of nanodevices: 
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This quadratic dependence on A has to be accounted 
for. Basically, it is 8 and the number of devices iV, and 
not (only) R p and pf GATE , which should be used when 
trying to accurately predict the advantages of vN- 
MUX — or of any other redundancy scheme. 



The next step we took was to compare the device- 
level estimates of MAJ-A vN-MUX (Fig. 3(a)), with 
the Monte Carlo simulation ones (Fig. 3(b), adapted 
from (Beiu, 2005) (Beiu & Sulieman, 2006)). The two 
plots in Fig. 3 have the same vertical scale and exhibit 
similar shapes. Still, a direct comparison is not trivial 
as these are mapped against different variables (e and 
respectively v). This similarity makes us confident that 
the estimated results are accurate and supporting the 
claim that a simple estimate for pf GATE leads to good 
approximations at the system level. For other insights 
the interested reader should consult (Anghel & Nico- 
laidis, 2007) and (Martorell et al., 2007). 
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Figure 3. Probability of failure ofMAJ-3 vN-MUX: (a) using device-level estimates and exact counting results; 
(b) C-SET Monte Carlo simulations for MAJ -3 vN-MUX 
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FUTURE TRENDS 

In a first set of experiments, we compared the reliability 
of MAJ-A with the reliability of MAJ-A vN-MUX at 
R F = 2A. For device-level analyses this is not obvious 
anymore as MAJ-A are not on a 45° line, which makes 
it hard to understand where and by how much vN-MUX 
improves over MAJ-A. The results of these simulations 
can be seen in Fig. 4, where we have used the same 
intervale e [0,0.11] on the horizontal axis. Here again 
it looks like the smallest fan-in is the best. 

A second set of experiments has studied the effect 
of changing A on the error threshold of MAJ-A vN- 
MUX. Fig. 5(a) shows the theoretical gate-level error 
thresholds (using eq. (1)), as well as the achievable 
gate-level error thresholds evaluated based on simu- 
lations using the exhaustive counting algorithm. Fig. 
5(a) shows that the exact gate-level error thresholds are 
higher than the theoretical gate-level error threshold 
values (by about 33%). It would appear that one could 
always enhance reliability by going for a larger fan-in. 
The extension to device-level estimates can be seen in 
Fig. 5(b), which reveals a completely different picture. 
These results imply that: 

device-level error thresholds are about 10 x less 
than the gate-level error thresholds; 



device-level error thresholds are decreasing with 
increasing to fan-ins (exactly the opposite of gate- 
level error thresholds); 

for vN-MUX, the highest device-level error 
threshold of about 4% is achieved when using 
MAJ-3. 



CONCLUSION 

This chapter has presented a detailed analysis of MA J- A 
vN-MUX for very small fan-ins : exact for the gate-level 
and estimated but accurate for the device-level. The 
main conclusions are as follows. 

Exact gate-level error thresholds for MAJ-A vN- 
MUX are about 33% better than the theoretical 
ones and increase with increasing fan-in. 
Estimated device-level error thresholds are about 
10x lower than gate-level error thresholds and 
are decreasing with increasing fan-ins — making 
smaller fan-ins better (Beiu & Makaruk, 1998), 
(Ibrahim & Beiu, 2007). 

The abnormal (nonlinear) behavior of vN-MUX 
(Beiu, 2005), (Beiu & Sulieman, 2006) is due 
to the fact that the elementary gates are made 
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Figure 4. Probability of failure ofMAJ-A and MAJ-A vN-MUX plotted versus the device probability of failure 
s: (a) A = 3; (b) A = 5; (c) A = 7; (d) A = 9. For uniformity, the same interval s e [0, 0.11] was used for all 
the plots 
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of unreliable nanodevices (implicitly accounted 
for by Monte Carlo simulations, but neglected 
by theoretical approaches and gate-level simula- 
tions). 

Extending the exact gate-level simulations to 
device-level estimates using 1 -(1 -s) n , leads 
to quite accurate approximations (as compared 
to Monte Carlo simulations). 



Device-level estimates show much more com- 
plex behaviors than those revealed by gate-level 
analyses (nonlinear for large 8, leading to multiple 
and intricate crossings). 

Device-level estimates suggest that reliability op- 
timizations for large 8 will be more difficult than 
what was expected from gate-level analyses. 
One way to maximize reliability when 8 is large 
and unknown {e.g., time varying (Srinivasan 
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Figure 5. Error thresholds for MAJ-A vN-MUX versus fan-in A: (a) gate-level error threshold, both theoretical 
(red) and exact (yellow) as obtained through exhaustive counting; (b) device-level error threshold 
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et al., 2005)) is to rely on 'adaptive' gates, so 
neural-inspiration should play a(n important) 
role in future nano-IC designs (Beiu & Ibrahim, 
2007). 

Finally, precision is very important as "small errors 
... have a huge impact in estimating the required level 
of redundancy for achieving a specified/target reli- 
ability" (Roelke et al., 2007). It seems that the current 
gate-level models tend to underestimate reliability, 
while we do not do a good job at the device-level, with 
Monte Carlo simulation the only widely used method. 
More precise estimates than the ones presented in this 
chapter are possible using Monte Carlo simulations 
in combination with gate-level reliability algorithms. 
These are clearly needed and have just started to be 
investigated (Lazarova-Molnar et al., 2007). 
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KEY TERMS 

Circuit: Network of devices. 

Counting (Exhaustive): The mathematical action 
of repeated addition (exhaustive considers all possible 
combinations). 



Device: Any physical entity deliberately affecting 
the information carrying particle (or their associated 
fields) in a desired manner, consistent with the intended 
function of the circuit. 

Error Threshold: The probability of failure of a 
component (gate, device) above which the multiplexed 
scheme is not able to improve over the component 
itself. 

Fan-In: Number of inputs (to a gate). 

Fault-Tolerant: The ability of a system (circuit) 
to continue to operate rather than failing completely 
(possibly at a reduced performance level) in the event 
of the failure of some of its components. 

Gate (Logic): Functional building block (in digital 
logic a gate performs a logical operation on its logic 
inputs). 

Majority (Gate): A logic gate of odd fan-in which 
outputs a logic value equal to that of the majority of 
its inputs. 

Monte Carlo: A class of stochastic (by using 
pseudorandom numbers) computational algorithms for 
simulating the behaviour of physical and mathemati- 
cal systems. 

Multiplexing (von Neumann): A scheme for 
reliable computations based on successive computing 
stages alternating with random interconnection stages 
(introduced by von Neumann in 1952). 

Redundancy (Factor): Multiplicative increase in 
the number of (identical) components (subsystems, 
blocks, gates, devices), which can (automatically) 
replace (or augment) failing component(s). 

Reliability: The ability of a circuit (system, gate, 
device) to perform and maintain its function(s) under 
given (as well as hostile or unexpected) conditions, for 
a certain period of time. 
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INTRODUCTION 

Working on artificial intelligence, one of the tasks we 
can carry on is optimization of the possible solutions 
of a problem. Optimization problems appear. In op- 
timization problems we search for the best solution, 
or one good enough, to a problem among a lot of 
alternatives. 

Problems we try to solve are usual in daily living. 
Every person constantly works out optimization prob- 
lems, e.g. finding the quickest way from home to work 
taking into account traffic restrictions. Humans can find 
efficiently solutions to these problems because these 
are easy enough. Nevertheless, problems can be more 
complex, for example reducing fuel consumption of a 
fleet of plains. Computational algorithms are required 
to tackle this kind of problems. A first approach to solve 
them is using an exhaustive search. Theoretically, this 
method always finds the solution, but is not efficient 
as its execution time grows exponentially. 

In order to improve this method heuristics were 
proposed. Heuristics are intelligent techniques, meth- 
ods or procedures that use expert knowledge to solve 
tasks; they try to obtain a high performance referring 
to solution quality and used resources. 

Metaheuristics, term first used by Fred Glover in 
1986 (Glover, 1986), arise to improve heuristics, and 



can be defined as (Melian, Moreno & Moreno, 2003) 
' intelligent strategies for designing and improving very 
general heuristic procedures with a high performance' . 
Since Glover the field has been extensively developed. 
The current trend is designing new metaheuristics that 
improve the solution to given problems. However, 
another line, very interesting, is reuse existing meta- 
heuristics in a coordinated system. In this article we 
present two different methods following this line. 



BACKGROUND 

Several studies have shown that heuristics and meta- 
heuristics are successful tools for providing reasonably 
good solutions (excellent in some cases) using a moder- 
ate number of resources. Abrief look at recent literature 
(Glover & Kochenberger, 2003), (Hart, Krasnogor & 
Smith, 2004), (Pardalos & Resende, 2002) reveals 
the wide variety of problems and methods which ap- 
pear under the overall topic of heuristic optimization. 
Within this, obtaining strategies which cooperate in 
a parallel way is an interesting trend. The interest is 
on account of two reasons: larger problem instances 
may be solved, and robust tools, that offer high quality 
solutions despite variations in the characteristics of the 
instances, may be obtained. 
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Different Approaches for Cooperation with Metaheuristics 



There are different ways of obtaining this coop- 
eration. One way are ant colony systems (Dorigo & 
Stiitzle, 2003) and swarm based methods (Eberhart & 
Kennedy, 2001) appear as one of the first cooperative 
mechanisms inspired by nature. Nevertheless, the co- 
operation principle they have presented to date is too 
rigid for a general purpose model (Crainic & Toulouse, 
2003). Another way are parallel metaheuristics, where 
very interesting perspectives appear. This is the line 
we will follow. 

There have been huge efforts to parallelize different 
metaheuristics . Thus we may find synchronic implemen- 
tations of these methods where the information is shared 
at regular intervals, (Crainic, Toulouse & Gendreau, 
1997) using Tabu Search and (Lee & Lee, 1992) using 
Simulated Annealing. More recently there have been 
multi-thread asynchronic cooperative implementations 
(Crainic, Gendreau, Hansen & Mladenovic, 2004) or 
multilevel cooperative searches (Banos, Gil, Ortega 
& Montoya, 2004) which, according to the reports in 
(Crainic & Toulouse, 2003) provide better results than 
the synchronic implementations. 

However, it seems that a cooperative strategy 
based on a single metaheuristic does not cover all the 
possibilities and the use of strategies which combine 
different metaheuristics is recommended. The paper 
(Le Bouthillier & Crainic, 2005) is a good example. A 
whole new area of research opens up. Questions such as, 
'what will be the role of each metaheuristic?' or 'What 
cooperation mechanisms should be used?' arise. 

Within parallel metaheuristics, we will focus 
following the classification of (Crainic & Toulouse, 
2003) on Multi-search metaheuristics, where several 
concurrent strategies search the solution space. Among 
them, we concentrate on those techniques, known as 
Cooperative multi-search metaheuristics, where each 
strategy exchanges information with the others during 
the execution. 

Cooperative multi-search metaheuristics obtain 
better quality solutions than independent methods. But 
previous studies (Crainic & Toulouse, 2002), (Crainic, 
Toulouse & Sanso, 2004) demonstrate that coopera- 
tive methods with a non-restrictive access to shared 
information may experiment problems of premature 
convergence. This seems to be due to the stabilization 
of the shared information, stabilization caused by the 
intense exchange of the better solutions. So it would 
be interesting to find a way of controlling this informa- 
tion exchange. 



In this context we propose two approaches in or- 
der to control the exchange of information, one using 
memory to cope with this problem, and the other using 
a process of knowledge extraction. 

The first approach (Pelta, Cruz, Sancho-Royo & 
Verdegay, 2006) proposes a cooperative strategy where 
a coordinating agent, modelled by a set of 'ad hoc' fuzzy 
rules, receives information from a set of solver agents 
and sends instructions to each of them telling how to 
continue. Each solver agent implements the Fuzzy 
Adaptive Neighbourhood Search (FANS) metaheuristic 
(Blanco, Pelta & Verdegay, 2002) as a clone. FANS is 
conceived as an adaptive fuzzy neighbourhood based 
metaheuristic. Its own characteristics allow FANS to 
capture the qualitative behaviour of several metaheu- 
ristics, and thus, can be considered as a "framework" 
of metaheuristics. 

The second approach (Cadenas, Garrido, Liern, 
Munoz & Serrano, 2007) uses the same structure 
but combines a set of different metaheuristics which 
cooperate within a single coordinated schema, where 
a coordinating agent modelled by a set of fuzzy rules 
receives information from the different metaheuristics 
and sends instructions to each of them. The difference 
with the previous system lies on the way the rules are 
obtained. Here, as a result of a knowledge extraction 
process (Cadenas, Garrido, Hernandez & Munoz, 
2006), (Cadenas, Diaz-Valladares, Garrido, Hernandez 
& Serrano, 2006). 



TWO COOPERATIVE MULTI -SEARCH 
METAHEURISTICS 

A Cooperative Multi-Search 
Metaheuristic Using Memory and 
Fuzzy Rules 

The idea of the first strategy can be explained with the 
help of the diagram in Fig. 1. Given a concrete prob- 
lem to solve, we have a set of solvers to deal with it. 
Each solver develops a particular strategy to solve the 
problem independently, and the whole set of solvers 
works simultaneously without direct interaction. In 
order to coordinate the solvers there is a coordinator 
which knows all the general aspects of the problem 
concerned and the particular solver features. The 
coordinator receives reports from the solvers with the 
obtained results, and returns orders to them. 
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Figure 1. Diagram of the first strategy 
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The inner workings of the strategy are quite simple. 
In the first step, the co-ordinator determines the initial 
behaviour (set of parameters) for each solver, which 
models a particular optimization strategy. This behav- 
iour is passed to each solver, which is then executed. 
For each solver, the co-ordinator keeps the last two 
reports containing their times, and the corresponding 
solutions at such times. From the solver information 
set, the co-ordinator calculates performance measures 
that will be used to adapt the fuzzy rule base. This 
fuzzy rule base is generated manually following the 
principle that If a solver is working well, keep it; but 
if a solver seems to be trapped, do something to alter 
its behaviour. 

Solvers execute asynchronously by sending and 
receiving information. The co-ordinator checks which 
solver provided new information and decides whether 
its behaviour needs to be adapted using the fuzzy rule 
base. If this is the case, it will obtain a new behaviour and 
send it to the solvers. Solver operation is quite simple: 
once execution has begun, performance information 
is sent and adaptation orders from the coordinator are 
received alternately. 

Each solver thread is implemented by means of 
the FANS metaheuristic. We use this metaheuristic 
following three main reasons: 



FANS is essentially a threshold-acceptance local 
search technique and is therefore easy to under- 
stand and implement, and does not require many 
computational resources. 
FANS can be used as a heuristic template. Each 
set of parameters implies a different behaviour 
of the method, and therefore, the qualitative be- 
haviour of other local search techniques can be 
simulated (Blanco, Pelta & Verdegay, 2002). 
The previous point enables different search 
schemes to be built, and diversification and 
intensification procedures to be driven easily. 
We therefore do not need to implement different 
algorithms but merely use FANS as a template. 

A Cooperative Multi-Search 
Metaheuristic Using Data Mining to 
Obtain Fuzzy Rules 

The idea of the second strategy is very similar to the 
first one, and can be seen on Fig. 2. Given a concrete 
problem to solve, we have a set of solvers to deal with 
it. Each solver implements a different metaheuristic, 
and has to solve the problem while coordinates itself 
with the rest of metaheuristics. In order to perform the 
cooperation we use a coordinating agent which will con- 
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Figure 2. Diagram of the second strategy 
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Figure 3. The knowledge extraction process 
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trol and modify the behaviour of the agents. To perform 
the communication among the different metaheuristics 
an adapted blackboard model is used. In this model 
each agent controls a part of the blackboard where 
writes its performance information, and periodically 
updates the solution found. The coordinator consults 
the blackboard in order to monitor the behaviour of 



each metaheuristic and to decide how their behaviour 
has to be modified. 

To give intelligence to the coordinator we propose 
the use of a set of fuzzy rules obtained as a result of 
the methodological process proposed on (Cadenas, 
Garrido, Hernandez & Munoz, 2006), (Cadenas, Diaz- 
Valladares, Garrido, Hernandez & Serrano, 2006), and 
based on a process of knowledge extraction. 
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The process of knowledge extraction is shown in 
(Fig. 3) and divided in the following phases: 

1. Preparation of data. In this phase we try to obtain 
a database containing useful information so that 
Data Mining could be applied. The metaheuris- 
tics that are going to be used in the system are 
chosen and applied to solve sets of instances of 
the problem, extracting from these executions 
interesting data such as the parameters of each 
metaheuristic or the solutions obtained. After 
that, it is advisable to apply a preprocess to the 
database in order to obtain those attributes and 
instances more relevant 

2. Data mining. In the second phase, Data Mining 
techniques are applied to the information obtained 
in the previous phase in order to get the model 
of the system coordinator. Firstly a Data Mining 
technique is chosen, then is applied to the data- 
base and finally using the models obtained, a set 
of fuzzy rules is deduced. 

3 . Evaluation. In this phase we test the efficiency of 
the model of the coordinator, and if it is perform- 
ing efficiently with regard to computational cost 
and the solutions obtained. 

In this paper we present a synchronous implementa- 
tion of this model where three metaheuristics are used: 
a genetic algorithm, a tabu search and a simulated 
annealing. 

The system finally obtained, operates as follows: 
first, the coordinator sets the initial set of parameters 
for each metaheuristic, according to the knowledge 
previously extracted. After that each solver starts its 
search, periodically all the solvers stop and write their 
solutions, then the coordinator evaluates them and, 
using the fuzzy rule base, decides how has to change 
the solutions and parameters of each metaheuristic. To 
formulate the rule base we used the knowledge extrac- 
tion process previously defined whose data mining phase 
was performed using fuzzy decision trees. 

Results Obtained by These Approaches 

Once we have studied both systems let us show the 
results obtained by them. Both strategies were tested 
solving the knapsack problem, whose mathematical 
formulation is 



max^p. 

7=1 



xx. 

7 



s.t.Jw 7 xx y <C, Xj e{0,l}j=l,...,n 

7=1 

where n is the number of items, x indicates whether 

' 7 

the item j is included in the knapsack or not, p is 
the profit associated with itemy, w.e [0, ..., r] is the 
weight of the itemy, and C is the knapsack capacity. 
We also assume that w. < C, V/ (every item fits in the 
knapsack), and 

± Wj >C 

7=1 

(the whole set of items does not fit). 

We chose knapsack problem because of the fact 
that we can construct test instances varying hardness 
according to three characteristics: instance size, type 
of correlation between weights and profits, and range 
of the values available for the weights. 

We finally carried out the tests with an implementa- 
tion of each system. The implementation of the memory 
based system used six solvers, each one implementing 
FANS with different parameters, and being executed for 
30 seconds (Memory based in table 1). The implemen- 
tation of the data mining based system, as said before, 
used three solvers, each one implementing a different 
metaheuristic (a genetic algorithm, a tabu search and a 
simulated annealing), andbeing executed for 60 seconds 
(DM based in table 1). In order to test the performance 
of the systems we also executed each metaheuristic 
individually (FANS, Tabu Search, Simulated Anneal- 
ing, Genetic Algorithm) for 180 seconds. 

In table 1 we show the average error obtained for 
different types and sizes of instances, comparing the 
Memory based approach with individual FANS, and 
the DM based with the average results of the three 
metaheuristics that compose it. As we can see each 
strategy outperforms its components. 



FUTURE TRENDS 

After this work several lines of research arise. Related 
to first strategy, the topic of what kind of information 
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Table 1. Average error for knapsack problem 







Memory 
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FANS 


DM based 


Avg. 
Individual 
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0,84 


3,2 
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2.04 
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1,32 


4,31 


sc 


1000 


0.94 


3.16 


0,93 


3,6 


2000 


1.69 


3.81 


2,29 


5,38 


Circle 


1000 


6.73 


10.42 


5,14 


15,44 


2000 


8.10 


13.41 


12,11 


18,26 



is stored in the coordinator memory and how this is 
used to control the global search behavior of the strat- 
egy. Related to the second, the knowledge extraction 
process needs to be improved and tested with different 
data mining techniques. And to both strategies, the 
improvement of the fuzzy rule base is another topic 
to be addressed. 

One last consideration is the application field. The 
knapsack problem is considered one of the "easiest" 
NP-hard problems, so it would be interesting to apply 
these strategies to more complex problems such as 
the p-median, p-hub median or the protein structure 
comparison. 



CONCLUSION 

This paper proposes two strategies to cope with 
convergence problems showed by cooperative multi- 
search metaheuristics. The first strategy suggests the 
use of memory in order to define a set of fuzzy rules 
which control the exchanges of solutions associated 
to a coordinated schema where similar metaheuristics 
cooperate. The second strategy proposes the use of a 
knowledge extraction process to obtain a set of fuzzy 
rules which control the exchanges of solutions of a 
coordinated system where different metaheuristics 
cooperate. Both approaches have been tested and have 
shown their good performance. 
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KEY TERMS 

Blackboard: A shared repository of problems, par- 
tial solutions, suggestions, and contributed information. 
The blackboard can be seen as a dynamic "library" of 
contributions to the current problem that have been 
recently "published" by other knowledge sources. 

Cooperative Multi-Search Metaheuristics: A 

parallelization strategy for metaheuristics in which 
parallelism is obtained from multiple concurrent explo- 
rations of the solution space and where metaheuristics 
exchange information during their execution in order 
to cooperate. 

Data Mining: The most characteristic stage of the 
Knowledge Extraction process, where the aim is to 
produce new useful knowledge by constructing a model 
from the data gathered for this purpose. 

Fuzzy Rules: Linguistic if-then constructions that 
have the general form "if A then B" where A and B 
are collections of propositions containing linguistic 
variables (A is called the premise and B is the con- 
sequence). The use of linguistic variables and fuzzy 
if-then rules exploits the tolerance for imprecision 
and uncertainty. 

Heuristic: A method or approach that tries to apply 
expert knowledge in the resolution of a problem with 
the aim of increasing the probability of solving it. 
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Knowledge Extraction/Discovery: The non-trivial Parallel Metaheuristics : Metaheuristics in which 

process of identifying valid, novel, potentially useful different threads search concurrently the solution space. 

and ultimately understandable patterns from large They appear naturally in develop of metaheuristics as a 

data collections. The overall process and discipline of way of improving the acceleration factor in the search 

extracting useful knowledge and includes data ware- of solutions. 

housing, data cleansing and data manipulation tasks ^ , , A . 

• u..u u . .u • , .• j i •. .• £ Problem Instance: A concrete representation of a 

right through to the interpretation and exploitation of __ . _ _ . . _ n . . . _ . _ 

, problem with characteristics that distinguish it from 

the rest. 

Metaheuristic: A high-level strategy for solving 

a very general class of computational problems by 

combining user given black-box procedures — usually 

heuristics — in a hopefully efficient way. 

Optimization Problem: A computational prob- 
lem whose object is to find the best from all feasible 
solutions. 
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INTRODUCTION 

Many practical engineering applications can be formu- 
lated as a global optimization problem, in which objec- 
tive function has many local minima, and derivatives 
of the objective function are unavailable. Differential 
Evolution (DE) is a floating-point encoding evolution- 
ary algorithm for global optimization over continuous 
spaces (Storn & Price, 1997) (Liu & Lampinen, 2005) 
(Price, Storn & Lampinen, 2005) (Feoktistov, 2006). 
Nowadays it is used as a powerful global optimization 
method within a wide range of research areas. 

Recent researches indicate that self-adaptive DE 
algorithms are considerably better than the original DE 
algorithm. The necessity of changing control param- 
eters during the optimization process is also confirmed 
based on the experiments in (Brest, Greiner, Boskovic, 
Mernik, Zumer, 2006a). DE with self-adaptive control 
parameters has already been presented in (Brest et al., 
2006a). 

This chapter presents self-adaptive approaches that 
were recently proposed for control parameters in DE 
algorithm. 



BACKGROUND 
Differential Evolution 

DE creates new candidate solutions by combining the 
parent individual and several other individuals of the 
same population. A candidate replaces the parent only 
if it has better fitness value. 

The population of the original DE algorithm (Storn 
& Price, 1995) (Storn & Price, 1997) contains NP D- 
dimensional vectors: x , z = 1, 2, ..., NP. G denotes 
the generation. The initial population is usually selected 
uniform randomly between the lower and upper bounds . 
The bounds are specified by the user according to the 
nature of the problem. After initialization DE performs 
several vector transforms (operations): mutation, cross- 
over, and selection. 



Mutant vector v. G can be created by using one of 
the mutation strategies (Price et al., 2005). The most 
useful strategy is 'rand/1 ' : v. G = x rl G + F x (x r2 G - x r3 G ), 
where F is the mutation scale factor within range [0, 
2], usually less than 1. Indexes rl, r2, r3 represent the 
random and distinct integers generated within range 
[1, NP], and also different from index z. 

After mutation, a 'binary' crossover operation forms 
the trial vector u G , according to the z th population vector 
and its corresponding mutant vector v /G : 

if (rand < CR ory =j ra J then u.. G = v. jG else u. jG = 

X iJ,G > 

where z = 1, 2, ..., NP and 7 = 1, 2, ..., D. CR is the 
crossover parameter or factor within the range [0,1] 
and presents the probability of creating parameters 
for the trial vector from the mutant vector. Uniform 
random value rand is within [0, 1]. Index j mnd e [1, 
NP] is a randomly chosen index and is responsible for 
the trial vector containing at least one parameter from 
the mutant vector. 

The selection operation selects, according to the 
objective fitness value of the population vector x. G and 
its corresponding trial vector u G , which vector will 
survive to be a member of the next generation. 

The original DE has more strategies and Feoktistov 
(Feoktistov, 2006) proposed some general extensions 
to DE strategies. The question is which strategy is the 
most suitable to solve a particular problem. Recently 
some researchers used various combinations of two, 
three or even more strategies during the evolutionary 
process. 

Parameter Tuning and Parameter Control 

Globally, we distinguish between two major forms of 
setting parameter values: parameter tuning and param- 
eter control (Eiben, Hinterding & Michalewicz, 1 999). 
The former means the commonly practiced approach 
that tries to find good values for the parameters before 
running the algorithm, then tuning the algorithm using 
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these values, which remain fixed during the run. The 
latter means that values for the parameters are changed 
during the run. According to Eiben et al. (Eiben et 
al., 1999) (Eiben & Smith, 2003), the change can be 
categorized into three classes: 

1. Deterministic parameter control takes place 
when the value of a parameter is altered by some 
deterministic rule. 

2. Adaptive parameter control is used when there is 
some form of feedback from the search that is used 
to determine the direction and/or the magnitude 
of the change to the parameter. 

3. Self-adaptive parameter control is the idea that 
"evolution of the evolution" can be used to imple- 
ment the self-adaptation of parameters. Here, 
the parameters to be adapted are encoded into 
the chromosome (individuals) and undergo the 
actions of genetic operators. The better values of 
these encoded parameters lead to better individu- 
als which, in turn, are more likely to survive and 
produce offspring and, hence, propagate these 
better parameter values. 

DE has three control parameters: amplification 
factor of the difference vector - F, crossover control 
parameter - CR, and population size - NP. The original 
DE algorithm keeps all three control parameters fixed 
during the optimization process. However, there still 
exists a lack of knowledge about how to find reason- 
ably good values for the control parameters of DE, for 
a given function (Liu & Lampinen, 2005). 

Although the DE algorithm has been shown to 
be a simple, yet powerful, evolutionary algorithm 
for optimizing continuous functions, users are still 
faced with the problem of preliminary testing and 
hand-tuning of its control parameters prior to com- 
mencing the actual optimization process (Teo, 2006). 
As a solution, self-adaptation has proved to be highly 
beneficial for automatically and dynamically adjusting 
control parameters. Self -adaptation allows an evolu- 
tionary strategy to adapt itself to any general class of 
problem, by reconfiguring itself accordingly, and does 
this without any user interaction (Back, 2002) (Back, 
Fogel & Michalewicz, 1997) (Eiben, Hinterding & 
Michalewicz, 2003). 



RELATED WORK 

Work Releted to Differential Evolution 

The DE (Storn & Price, 1995) (Storn & Price, 1997) 
algorithm was proposed by Storn and Price, and since 
then it has been used in many practical cases. The 
original DE was modified and many new versions 
have been proposed. Ali and Torn (Ali & Torn, 2004) 
proposed new versions of the DE algorithm, and also 
suggested some modifications to the classical DE, in 
order to improve its efficiency and robustness. They 
introduced an auxiliary population of NP individuals 
alongside the original population (noted in (Ali, 2004), 
a notation using sets is used). Next they proposed a rule 
for calculating the control parameter F, automatically. 
Jiao et al. (Jiao, Dang, Leung & Hao, 2006) proposed a 
modification of the DE algorithm, applying a number- 
theoretical method for generating the initial population, 
and using simplified quadratic approximation with the 
three best points. Mezura-Montes et al. (Mezura-Mon- 
tes, Velazquez-Reyes & Coello Coello, 2006) conducted 
a comparative study of DE variants. They proposed a 
rule for changing control parameter F at random from 
interval [0.4, 1.0] at generation level. They used differ- 
ent values of control parameter CR for each problem. 
The best CR value for each problem was obtained by 
additional experimentation. Tvrdik in (Tvrdik, 2006) 
proposed a DE algorithm using competition between 
different control parameter settings. The prominence 
of the DE algorithm and its applications is shown in 
recently published books (Price et al., 2005), (Feoktis- 
tov, 2006). Feoktistov in his book (Feoktistov, 2006, p. 
18) says, that "the concept of differential evolution is a 
spontaneous self -adaptability to the function". 

Work Releted to Adaptive or 
Self-Adaptive DE 

Liu and Lampinen (Liu & Lampinen, 2005) proposed 
a version of DE, where the mutation control param- 
eter and the crossover control parameter are adaptive. 
A self-adaptive DE (SDE) is proposed by Omran et 
al. (Omran, Salman & Engelbrecht, 2005) (Salman, 
Engelbrecht & Omran, 2007), where parameter tuning 
is not required. Self-adapting was applied for control 
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parameters F and CR. Teo (Teo, 2006) made an attempt at 
self-adapting the population size parameter, in addition 
to self-adapting crossover and mutation rates. Brest et 
al. (Brest et al., 2006a) proposed a DE algorithm, using 
a self-adapting mechanism on the control parameters 
F and CR. The performance of the self-adaptive dif- 
ferential evolution algorithm was evaluated using a 
set of benchmark functions provided for constrained 
real parameter optimization (Brest, Zumer & Sepesy 
Maucec, 2006b). Qin and Suganthan (Qin & Suganthan, 
2005) proposed the "Self-adaptive Differential Evolu- 
tion algorithm (SaDE), where the choice of learning 
strategy and the two control parameters F and CR do 
not require pre-defining. During evolution, suitable 
learning strategy and parameter settings are gradually 
self -adapted, according to the learning experience." 
Brest et al. (Brest, Boskovic, Greiner, Zumer & Sepesy 
Maucec, 2007) reported performance comparison of 
certain selected DE algorithms, which use different self- 
adaptive or adaptive control parameter mechanisms. In 
this paper the DE algorithms used more than one of the 
DE strategies (Price et al., 2005), (Feoktistov, 2006). 
Self-adaptation has been used extensively in evolution- 
ary programming (Fogel, 1995) and evolution strategies 
(ES) (Back & Schwefel, 1 993) to adjust the search step 
size for each objective variable (Liang, Yao & Newton, 
2001). Abbass (Abbass, 2002) proposed self-adaptive 
DE for multi-objective optimization problems. 



NP. Both of the control parameters were applied at 
individual level. 

Brest et al. (Brest et al., 2006a) proposed a self- 
adaptive DE where new control parameters F. G+1 and 
CR . _,, are calculated as follows: 

i,G+l 



if rand, < x, then F.~ M , = F + rand x F else F.„ M , 

— 11 i,G+l I 2 u i,G+l 



= F 



i,G > 



if rand 3 < x 2 
CR.„, 



then 



CR lG+ i = rand 4 



else CR 



i,G+l 



and they produce control parameters F and CR in a 
new vector. The quantities rand, j e {1, 2, 3, 4} are 
uniform random values e [0, 1]. The quantities x 
and x 2 represent the probabilities of adjusting control 
parameters F and CR, respectively. The parameters x 1? 
x 2 , F p F u were taken as fixed values 0.1, 0.1, 0.1, 0.9, 
respectively. The new F takes a value from [0.1,1.0] 
in a random manner. The new CR takes a value from 
[0,1]. The new F.~ M , and CR . _. are obtained before 

7 i,G+l i.G+1 

the mutation is performed. So they influence the mu- 
tation, crossover and selection operations of the new 



vector x 



i,G+l' 



In (Brest et al., 2006a) a self-adaptive control 
mechanism was used to change the control parameters 
F and CR during the evolutionary process. The third 
control parameter NP was kept unchanged. 



SELF-ADAPTIVE CONTROL 
PARAMETERS IN DIFFERENTIAL 
EVOLUTION 

This section presents three Self-Adaptive DE ap- 
proaches, which has been applied to the control pa- 
rameters F and CR. 

The Self-Adaptive Control Parameters 
Using Uniform Distribution 

The Self -Adaptive DE refers to the self -adapting 
mechanism on the control parameters, proposed by 
Brest et al. (Brest et al., 2006a). This self-adapting 
mechanism used 'rand/1 /bin 'strategy. Each individual 
in the population was extended using the values of 
two control parameters: (x , F , CR ), z e 1, 2, ..., 



Abbass's Approach 

Abbass (Abbass, 2002) proposed Self-adaptive Pareto 
Differential Evolution (SPDE) algorithm. The SPDE 
was used for multi-objective optimization problems. 
New control parameters F. G+1 and CR. G+1 are calculated 
as follows: 



^+, = ^0,1), 



CR„ G+1 = CR rl:G + iV(0,D >< {CR 2G - CRJ , 

where iV(0,l) is Gaussian distribution. If F. G+] value 
is not in [0,1] then simple rule is used to repair it. And 
similar for CR.^ 7 value. Then mutant vector v ._ is 

i,G+l i,G 

calculated: 



\G = X rl,G + F h G + l X ( X r2,G- X r3,G) 
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and crossover operation is performed: 

if (rand < CR.^ 7 or / = / ,) then i/.. =v. . n else 

— v — i,G+l J J rand' i,j,G i,j,G 

u.. r = x.. r , 

i,j,G i,j,G 7 

The control parameter CR is self-adapted, by encod- 
ing it into each individual. 

Approach Proposed by Omran, Salman 
and Engelbrecht 

Due to the success achieved in SPDE by self-adapt- 
ing CR, Omran et al. (Omran, Salman & Engelbrecht, 
2005) (Salman, Engelbrecht & Omran, 2007) proposed 
a self-adaptive DE (SDE), where the same mechanism 
is applied to self-adapt the control parameter F. The 
control parameter CR is generated for each individual 
from a normal distribution (CR ~ iV(0.5,0.15)) in SDE 
and the mutation operation changes as follows: 

\ G = X rl t G + F i f G + l X ( X r2 t G' X r3^ 

where 



evaluations) and robust (e.g. it does not get trapped in 
local optimum) at the same time. Based on our experi- 
ences (other authors reported similar observations) with 
the DE algorithm, we can conclude that DE provides 
more robustness if the population size is higher. If the 
population size is increased, on the other hand, more 
computational power (on average) is needed. The 
population size is an important control parameter, and 
thus adaptive and/or self-adaptive approaches on this 
parameter are expected in the future. 

In this chapter only the 'rand/1 /bin ' DE strategy is 
used, but the DE algorithm has more strategies (Price 
et al., 2005), (Feoktistov, 2006). Which combination of 
(self-adaptive) DE strategies should someone use to get 
the best performances. One can use the involved strategy 
with the same probability, or with different probability, 
or even use a self-adaptation to choose the most suitable 
strategy during the optimization process. 

Future work may also be directed towards testing the 
proposed self-adaptive versions of the DE algorithm, 
especially on constrained optimization problems. 
Multi-objective optimization is also a challenge for 
future work. 



F ,. G+1 = F r4 ,a + m0.5)HF r5 , G -F r J. 



Indexes r4, r5, r6 represent the random and distinct 
integers generated within range [1, NP]. Thus, each 
individual i has its own control parameter F. which 
is calculated as a stochastic linear combination of the 
control parameters of randomly selected individuals. 

The presented self-adapting mechanisms on the 
control parameters F and CR, use 'rand/1 /bin ' strategy. 
Both of the control parameters are applied at individual 
level. The third control parameter NP remains fixed 
during the evolutionary process. 



FUTURE TRENDS 

The behaviour of DE is influenced by values of its 
parameters (F, CR, NP). During last two decades a 
lot of papers addressed the problem of finding insight 
concerning the behaviour of the algorithm (Zaharie, 
2002). The theory of DE is still behind the empirical 
studies. The theoretical studies of DE are highly desir- 
able as future researches. 

It is not an easy task for one optimization algorithm 
to be both fast (e.g. it needs a small number of function 



CONCLUSION 

This chapter carried out differential evolution (DE) 
algorithm with focus on the self-adaptive control 
parameters. Three self-adaptive approaches, which 
were recently proposed in literature, are described in 
the chapter. The presented approaches have control 
parameters applied at individual level. If we look in 
literature, the self-adaptive versions of the DE algorithm 
usually gave better performance results in comparison 
to the original DE algorithm. We can conclude that 
self-adaptation can improve the performance of the 
DE algorithm and this powerful global optimization 
algorithm could be used over a wide-range of research 
areas in the future. 
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KEY TERMS 

Area of the Search Space: Set of specific ranges 
or values of the input variables that constitute a subset 
of the search space. 

Control Parameter: Control parameter determines 
behaviour of evolutionary program (e.g. population 
size). 

Differential Evolution: An evolutionary algorithm 
for global optimization, which realized the evolution 
of a population of individuals in a manner that uses of 
differences between individuals. 

Evolutionary Computation: Solution approach 
guided by biological evolution, which begins with 
potential solution models, then iteratively applies al- 
gorithms to find the fittest models from the set to serve 
as inputs to the next iteration, ultimately leading to a 
model that best represents the data. 

Individual: An individual represents a candidate 
solution. During the optimization process an evolution- 
ary algorithm usually uses a population of individuals 
to solve a particular problem. 

Search Space: Set of all possible situations of the 
optimization problem that we want to solve. 

Self- Adaptation: The ability that allows an evo- 
lutionary algorithm to adapt itself to any general class 
of problems, by reconfiguring itself accordingly, and 
do this without and user interaction. 
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INTRODUCTION 

Knowledge Representation is important part of AI. The 
purpose is to reveal best possible representation of the 
Universe of Discourse (UoD) by capturing entities, 
concepts and relations among them. With increased 
understanding of various scientific and technological 
disciplines, it is possible to derive rules that governs 
the behaviour and outcome of the entities in the UoD. 
In certain cases, it is not possible to establish any ex- 
plicit rule, yet through experience or observation, some 
experts can define rules from their tacit knowledge in 
specific domain. 

Knowledge representation techniques are focused on 
techniques that allows externalization of implicit and 
explicit knowledge of expert(s) with a goal of reuse 
in absence of physical presence of such expertise. To 
ease this task, two parallel dimensions have devel- 
oped over period of time. One dimension is focused 
on investigating more efficient methods that best suit 
the knowledge representation requirement resulting 
in theories and tools that allows capturing the domain 
knowledge (Brachman & Levesque, 2004). Another 
development has taken place in harmonization of tools 
and techniques that allows standard based representation 
of knowledge (Davies, Studer, & Warren, 2006). 

Various languages are proposed for representation of 
the knowledge. Reasoning and classification algorithms 
are also realized. As an outcome of standardization 
process, standards like DAML-OIL (Horrocks & Patel- 
Schneider, 2001), RDF (Manola & Miller, 2004) and 
OWL(Antoniou & Harmelen, 2004) are introduced. 
Capturing the benefit of both developments, the tool- 
ing is also came in to existence that allows creation of 
knowledgebase. 

As a result of these developments, the amount of 
publicly shared knowledge is continuously increasing. 
At the time of this writing, a search engine like S woogle 
(Ding et al., 2004)-developed to index publicly available 



Ontologies, is handling over 2,173,724 semantic web 
documents containing 431,467,096 triples. 

While the developments are yielding positive results 
by such a huge amount of knowledge available for reuse, 
it have become difficult to select and reuse required 
knowledge from this vast pool. The concepts and their 
relations that are important to the given problem could 
have already been defined in multiple Ontologies with 
different perspectives with specific level of details. It 
is very likely that to get complete representation of the 
knowledge, multiple Ontologies must be utilized. This 
requirement has introduced a new discipline within the 
domain of knowledge representation that is focused 
on investigation of techniques and tools that allows 
integration of multiple shared Ontologies. 



BACKGROUND 

The problem of Ontology integration is not completely 
new. Schema Matching is a similar problem being ad- 
dressed in the context of enterprise integration. But, in 
Ontology matching, the scale and complexity is much 
higher and requires special considerations. (Shvaiko 
& Euzenat, 2006) highlights the key similarities and 
differences between both the techniques. In schema 
matching, the semantics of the given term is guessed 
whereas the ontology matching methods relies on 
deriving the semantics from explicit representation of 
concepts and relations in given Ontology. Numerous 
methods and approaches have been proposed that at- 
tempt to solve the problem targeting specific aspects 
of the represented knowledge(Ehring, 2007). 

Apart from standards that guide the languages used 
for the development of Ontology, some standard Ontolo- 
gies have also been defined. The role of these Ontolo- 
gies is to provide framework of vary basic elements 
and their relations, based on which complex domain 
knowledge can be developed. SUO(Niles & Pease, 
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2001), SUMA(Niles & Pease, 2003), OpenCyc(Sicilia 
et al., 2004) are examples of the same. SWEET (Raskin, 
2003) provides standard Ontologies in environmental 
science domain. Hence, the levels in Ontology also 
address important dimension in knowledge engineering 
through integrating available Ontologies. 



ONTOLOGY MAPPING TECHNIQUES 

Research in integration of multiple Ontologies have 
resulted in various techniques and tools that have suc- 
cessfully demonstrated capabilities in producing the 
required results(Noy, 2004b). The ontology integration 
is addressed as Ontology mapping, matching, merging, 
transforming and other such activities. The integration 
is achieved by focusing on finding similarities among 
the concepts of separate Ontologies. The similarity 
or nearness can be established by employing various 
techniques, and numerous such approaches have been 
published demonstrating the suitability of single or 
hybrid approaches. The taxonomic overview of exist- 
ing methodology is provided in many survey papers 
that provides a reasonable entry in to the domain of 
Ontology integration. (Kalfoglou et al., 2005) provides 
comprehensive survey of Ontology mapping approach 
and classify them on Semantic Intensity Spectrum. (Noy, 
2004a) (Kalfoglou & Schorlemmer, 2005)and (Predoiu 
et al. , 2006) provides comprehensive survey discussing 
state-of-the-art of present research efforts. 

Ontologies consists of concepts and elements. 
The integration process that establishes the similarity 
among concepts consists of three dimensions (Shvaiko 
& Euzenat, 2006). The input dimension is related to 
underlying data model and can operate at schema level 
or instance level. Second is the process dimension that 
classifies approach as exact or approximate determina- 
tion. Third dimension deals with output in the form of 
Cardinality, type of relation and the confidence. Integra- 
tion can be done by identification of Alignment. 

Concept Level Approaches 

Concept level approaches are restricted only to the 
name of the concept and employ various methods to 
match whole or part of the concept names that belong 
to different Ontologies. Though these syntax oriented 
approaches proves to be less efficient when applied in 
isolation, they are generally employed in pre-integra- 



tion preparation phase (or normalization phase) of more 
complex semantic oriented approaches. Many of the 
Schema Matching techniques are directly applicable 
for concept level approaches. 

String Level Concept Matching 

It is based on the simple assumption that concept having 
similarity is represented with same name in different 
Ontologies. Upon identification of such string level 
similarity the source Ontologies can either mapped 
or merged. PROMPT(Noy & Musen, 2000) Ontology 
Merging tool employs string level concept matching 
approach. 

Sub-String Level Concept Matching 

Approaches that brakes the input concepts in to smaller 
segments on the basis of prefix, suffix and other struc- 
tures. Another approach establishes the similarity by 
identifying the Edit Distance. For example if Nikon 
and NKN are under consideration, the Edit Distance 
is a number of insertion, deletion and substitution of 
characters that will be required in Nikon and NKN 
to transform one into the other. N-gram technique is 
employed for deriving a set of substrings by selecting 
n number of characters from input string. For example 
trigram of NIKON results in NIK, IKO and KON. The 
derived set can further be subjected to simple string 
matcher for finding similarities. 

Lexical Matching 

Lexical approaches are employed to identify and extract 
tokens from the input string. This is particularly useful 
when concept name are created using mix of alpha- 
numeric characters that can be processed to separate 
operators, numbers, punctuations and other types of to- 
ken to reveal processable substrings. LOM(Li, 2004)- a 
Lexicon based Ontology Mapping tool employs strategy 
to determine similarity by matching the whole term, 
word constituent, synset, and type matching (Choi et al., 
2006). OLA(Euzenatet al., 2005) andCupid(Madhavan 
et al. , 200 1 ) also employs lexical techniques for finding 
similarity among concepts. 
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Linguistic Similarity Approach 



Structure Level Approaches 



Natural Language Processing domain offers various text 
processing techniques such as stemming, tokenization, 
tagging, elimination, expansion etc. that can improve 
result of similarity finding effort. Usage and grammar 
of language may result in mismatch for example, like 
Product and Products, in such cases Lemmatization 
can be used. Tokenization that can remove grammati- 
cal elements from concept name can be utilized. Un- 
necessary articles, Preposition, conjunction and other 
features can be removed using Elimination technique. 
Lexical relation can be also identified using sense based 
approach by exploring Hypernyms, hyponyms etc. in 
WordNet(Miller, 1995). HCONE-Merge(Kotis et al., 
2006) is a Ontology merging approach that uses Latent 
Semantic Analysis (LSA) technique and carry out lookup 
WordNet and by expanding word sense by hyponymy. 
Domain or Application specific terminology can also 
be integrated for disambiguation. Cupid - a schema 
matching approach-employs linguistic matching in 
its initial phase. Quick Ontology Mapping (Ehrig & 
Sure, 2005) employs finding of linguistic similarity of 
concepts. ASCO(Le et al., 2004) technique calculate 
linguistic similarity as a linear combination of name, 
label and description similarity of concepts. 

Semantic Concept Matching Approach 



Approaches that considers the ontology in graph 
structure and consider upper and lower levels of the 
given concepts to find out similarity among concepts 
of different Ontologies. 

Structural Similarity 

COMA(Aumueller et al., 2005) system employs path 
similarity as a basis for calculating similarity among 
concept. SMART tool employs algorithm that considers 
structure of the relation in the vicinity of the concept 
being processed. It is implemented with PROMPT 
system that can be plugged in with Protege-a widely 
accepted open-source knowledge engineering tool. 
Anchor-PROMPT(Noy & Musen, 2001) extends the 
simple PROMPT by calculating similarity measures 
based on ontology structure. 

Semantic Matching 

S-Match focuses on computing semantic matching be- 
tween graph structures extracted from separate Ontolo- 
gies. With this technique it is possible to differentiate the 
meaning of Concept of node and Concept at node. 

Lattice 



iMAP(Dhamankar et al., 2004) employs semantic 
matching for integration of heterogeneous data sources. 
It addresses 1-1 and complex matches among concepts. 
The multiple search approach toward identification of 
matches, utilizes domain knowledge for improving 
schema matching accuracy. S-Match (Giunchiglia et al., 
2005) derives semantic matching between two graph 
structures. MAFRA (Maedche et al., 2002) (MApping 
FRAmework) employs Semantic Bridge and Service 
Centric approach. 

Pairs 

By transforming the input Ontologies into directed 
labelled graph, the Similarity Flooding techniques 
generates Pair- wise Connectivity Graphs (PCG) where 
a node consists of a pair or matching elements from 
the sources. The technique further assigns weights to 
the edges indicating how well the similarity of given 
pair propagates to the neighbours. 



FCA-Merge(Stumme, 2005) incorporates machine 
learning Technique to derive a lattice which is then 
used to derive merged Ontology. Documents are ac- 
cepted as inputs that provides the concepts to build 
the Ontology. 

Machine Learning and Statistics Based 
Approach 

While determination of exact mapping achieved by 
establishing string and structure level similarities, 
the performance can be further improved by employ- 
ing methods to approximate the nearness. Machine 
learning, probability and statistics techniques can be 
incorporating in mapping techniques to improve the 
performance and achieving automation in matching 
process. GLUE(Doan et al., 2002) employs an instance 
based machine learning technique to semi-automati- 
cally create mappings for input Ontologies. Concept 
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Similarity in terms of joint probability distribution of 
instances of given concept. Ontology Mapping En- 
hancer (OMEN) (Mitra et al., 2005) adopts generation of 
Bayesian Net on the basis of mappings defined a priori. 
ITTalk (Cost et al., 2002) adopts Bayesian reasoning 
along with text based classification technology that 
collects similarity information in source Ontologies. 
The discovered semantic similarity is codified as labels 
and arcs of a graph. 

Community Oriented Approach 



mapping can also benefit from the representation and 
reasoning capability of DL. MAFRA(Maedche et al., 
2002) uses DL for representing Semantic Bridge Ontol- 
ogy (SBO). OntoMapO uses DL for creating a Meta 
Ontology for consistent representation of Ontologies 
and mapping among them. CTXMatch and S-Match 
employs satisfiability techniques that is achieved us- 
ing DL. ConcepTool(Compatangelo & Meisel, 2003) 
system uses DL in formalizing class-centered enhanced 
entity relationship model. 



CAIMAN(Lacher & Groh, 2001) proposed a scenario 
where member of communities want to express their 
viewpoints on categorization in community repository. 
COMA++(Aumueller et al., 2005) offers community 
driven ontology mapping by providing support for web 
based acquisition and sharing of ontology mapping. 

Logic Based Approach 

Description Logic (DL) is widely accepted for knowl- 
edge representation task. It is observed that Ontology 
creation is commonly carried out with commonly used 
tools that supports O WL-DL representation and reason- 
ing. The mapping language that is selected to represent 



FUTURE TRENDS 

Organizations are increasingly adopting knowledge- 
based approaches in building systems. The Ontology 
mapping techniques discussed here provides overview 
of efforts that enables integration of multiple Ontolo- 
gies in the context of targeted application. Along with 
the syntax and standardized content, it will become 
necessary to consider Ontologies with various levels of 
detail to be mapped for complete and accurate cover- 
age. Extending the current approaches that take a few 
Ontologies as input selected from vast pool based on 
the availability of required elements or structures, it 
will be necessary to consider specific types of Ontol- 



Figure 1. Ontology mediation requirement in future applications 
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- Maintained by individual organizations 
using legecy systems 

- No Alternatives 

- Scope : Administrative Unit 
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ogy. Figure 1 indicates one of the challenging open 
problems of investigating appropriate methods that 
allows integration of Ontologies defining Domain Inde- 
pendent, Domain Specific, Local Specific and Applica- 
tion Specific concepts to provide complete coverage of 
knowledge representation. This approach ensures that 
knowledge engineers can reuse the integrated part of 
Domain Independent and Domain Specific concepts 
and focus on only local specific concept to suitably 
integrate with application being built. For explaining 
a example scenario Figure 1 indicates the Ontology 
integration required for Disaster Management Agencies 
across the world. The Domain Specific and Domain 
Independent concepts are required by every agency 
and can be directly integrated with local specific and 
application specific concepts that can be unique to each 
implementing agency. 



CONCLUSION 

With proliferation of knowledge representation and 
reasoning techniques and standard based tools, do- 
main knowledge captured in the form of Ontology is 
increasingly being available for reuse. The quality and 
quantity and level of detail with which domain concepts 
are defined differ considerably based on the discretion 
of knowledge engineer. The reuse requires concepts 
defined in multiple such Ontologies to be extracted or 
mapped to a resulting comprehensive representation that 
suits the requirement of problem on hand. This in-turn 
introduce the problem of syntactic and semantic hetero- 
geneity. This article provided state-of-the-art in present 
techniques targeted at resolving heterogeneity. 
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KEY TERMS 

Articulation Ontology: Articulation ontology 
consists of concepts and relations that are identified 
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as link among the concepts defined in two separate 
Ontologies also known as articulation rules. 

Context Aware Techniques: Techniques that are 
focused on nearness of weight assigned to specific rela- 
tions among concepts considering application context 
as basis for mapping. 

Extension Aware Techniques: Techniques that 
are focused on finding nearness among features of 
available instances of different Ontologies to form 
basis for mapping. 

Intension Aware Techniques: Based on the In- 
formation flow theory, techniques that are focused on 
finding two different tokens (instances) belonging to 
separate Ontologies that maps to single type (concept) 
as a basis for mapping. 

Linguistic Similarity Techniques: Set of tech- 
niques that refer linguistic nearness of concepts in 
the form of synonyms, hypernyms, and hyponyms by 
referring to related entries in the thesaurus as basis 
for mapping. 

Ontology Alignment: Ontology Alignment is a 
process to articulate similarity in the form of one-to- 
one equality relation between every elements of two 
separate Ontologies. 

Ontology Integration: Ontology Integration is a 
process that results in generation of a new ontology 
derived as a union of two or more source Ontologies 
of different but related subject domain. 



Ontology Mapping: Ontology Mapping is a process 
to articulate similarities among the concepts belonging 
to separate source Ontologies. 

Ontology Mediation: Ontology mediation is a 
process that reconciles difference between separate 
Ontologies to achieve semantic interoperability by 
performing alignment, mapping, merging and other 
required operations. 

Ontology Merging: Ontology Mapping is a process 
that results in generation of a new ontology derived 
as a union of two or more source Ontologies of same 
subject domain. 

Semantic Similarity Techniques: Techniques 
that are focused on logic satisfiability as basis of 
mapping. 

String Similarity Techniques: Set of techniques 
that uses syntactic similarity of concepts as basis of 
mapping. 

Structure Aware Techniques: Techniques that 
also consider structural hierarchy of concepts as basis 
of mapping. 
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INTRODUCTION 

The need to deal with large data sets is at the heart of 
many real-world problems. In many organizations the 
data size has already surpassed Petabytes (10 15 ). It is 
clear that to process such an enormous amount of data, 
the physical limitations of RAM is a major hurdle. How- 
ever, the media that can hold huge data sets, i.e., hard 
disks, are about a 10,000 to 1,000,000 times slower to 
access than RAM. On the other hand, the costs for large 
amounts of disk space have considerably decreased. 
This growing disparity has led to a rising attention to 
the design of external memory algorithms (Sanders et 
al., 2003) in recent years. 

In a hard disk, random disk accesses are slow due 
to disk latency in moving the head on top of the data. 
But once the head is at its proper position, data can be 
read very rapidly. External memory algorithms exploit 
this fact by processing the data in the form of blocks. 
They are more informed about the future accesses to the 
data and can organize their execution to have minimum 
number of block accesses. 

Traditional graph search algorithms perform well 
as long as the graph can fit into the RAM. But for large 
graphs these algorithms are destined to fail. In the fol- 
lowing, we will review some of the advances in the field 
of search algorithms designed for large graphs. 



BACKGROUND 

Most modern operating systems provide a general- 
purpose memory management scheme called Vir- 
tual Memory to compensate for the limited RAM. 
Unfortunately, such schemes pay off only when the 
algorithm's memory accesses are local, i.e., it works on 
a particular memory address range for a while, before 
switching the attention to another range. Search algo- 
rithms, especially those that order the nodes on some 



particular node property, do not show such behaviour. 
They jump back and forth to pick the best node, in a 
spatially unrelated way for only marginal differences 
in the node property. 

External memory algorithms are designed with a 
hierarchy of memories in mind. They are analyzed on an 
external memory model as opposed to the traditional von 
Neumann RAM model. We use the two-level memory 
model by Vitter and Shriver (1994) to describe the search 
algorithms. The model provides the necessary tools to 
analyze the asymptotic number of block accesses (I/O 
operations) as the input size grows. It consists of 

M: Size of the internal memory in terms of the 

number of elements, 

N »M: Size of the input in terms of the number 

of elements, and 

B: Size of the data block that can be transferred 

between the internal memory and the hard disk; 

transferring one such block is called as a single 

I/O operation. 

The complexity of external memory algorithms 
is conveniently expressed in terms of predefined I/O 
operations, such as, scan(N) for scanning a file of size 
N with a complexity of &(N/B) I/Os, and sort(N) for 
external sorting a file of size N with a complexity of 
&(N/B log M/B (N/B)) I/Os. With additional parameters 
the model can accommodate multiple disks and multiple 
processors too. 

In the following, we assume a graph as a tuple (V, E, 
c), where Vis the set of nodes, E the set of edges, and 
c the weight function that assigns a non-zero positive 
integer to each edge. If all edges have the same weight, 
the component c can be dropped and the graphs are 
called as unweighted. Given a start node s and a goal 
node g, we require the search algorithm to return an 
optimal path wrt. the weight function. 
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EXTERNAL MEMORY SEARCH 
ALGORITHMS 

External Memory Breadth-First Search 

Breadth-first search (BFS) is one of the basic search 
algorithms. It explores a graph by first expanding the 
nodes that are closest to the start node. BFS for ex- 
ternal memory has been proposed by Munagala and 
Ranade (1999). It only considers undirected and explicit 
(provided beforehand in the form of adjacency lists) 
graphs. The working of the algorithm is illustrated on a 
graph in Fig. 1. Let Open(i) be the set of nodes at BFS 
level i residing on disk. The algorithm builds Open(i) 
from Open(z'-l) as follows. Let Succ(Open(i-l)) be the 
multi-set of successors of nodes in Open(z-l); this set 
is created by concatenating all adjacency lists of nodes 
in Open(z-l). As there can be multiple copies of the 
same node in this set the next step is to remove these 
duplicate nodes. In an internal memory setting this 
can be done easily using a hash table. Unfortunately, 
in an external setting a hash-table is not affordable due 
to random accesses to its contents. Therefore, we rely 
on alternative methods of duplicates' removal that are 
well-suited for large data on disk. The first step is to 
sort the successor set using external sorting algorithms 
resulting in duplicate nodes lying adjacent to each 



other. By an external scanning of this sorted set, all 
duplicates are removed. Still, there can be nodes in 
this set that have already been expanded in the pre- 
vious layers. Munagala and Ranade proved that for 
undirected graphs, it is sufficient to subtract only two 
layers, Open(z'-l) and Open(z-2), from Open(i). Since 
all three lists are sorted, this can be done by a parallel 
external scanning. The accumulated I/O complexity of 
this algorithm is 0(\V\ + sort(|E|)) I/Os, where \V\ is 
for the unstructured access to the adjacency lists, and 
sort(\E\) for duplicates removal. 

An implicit graph variant of the above algo- 
rithm has been proposed by Korf (2003). It applies 
0(sort(|Si/cc(Open(z-l))|) +scan(|Open(z'-l)| + 
|Open(z-2)|))) I/Os in each iteration. Since no explicit 
access to the adjacency list is needed (as the state space 
is generated on-the-fly), by using X. \Succ(Open(i))\ = 
0(|£|) and£. \Open(i)\ = 0(| V|), the total execution time 
is bounded by 0(sort(\E\)+ scanfl V|)) I/Os. 

To reconstruct a solution path, we may store pred- 
ecessor information with each node on disk (thus 
doubling the state vector size). Starting from the goal 
node, we recursively search for its predecessor in the 
previous layer through external scanning. The process 
continues until the first layer containing the start node 
is reached. Since the Breadth-first search preserves 
the shortest paths in a uniformly weighted graph, the 



Figure 1. An example graph (left); Stages of External Breadth-First Search (right). Each horizontal bar cor- 
responds to a file. The grey-shaded A(2) and A' (2) are temporary files. 
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constructed path is the optimal one. The complexity 
is bounded by the scanning time of all layers in con- 
sideration, i.e., by 0(scan(\V\)) I/Os. 

External Memory Heuristic Search 

Heuristic search algorithms utilize some form of guid- 
ance; be it user-provided or automatically inferred 
from the problem structure, to hone in on the goal. 
In practice, such algorithms are very effective when 
compared with the blind search algorithms like BFS. 
A* (Hart et al., 1968) is one such algorithm that pri- 
oritizes the nodes based on their actual distance from 
the start node along with their heuristic estimate to the 
goal state. Formally, if g(n) represents the path cost for 
reaching a node n, and h(n) the heuristic estimate of 
reaching the goal starting from n, then A* orders the 
nodes based on their f- value defined as f(n)=g(n)+h(n). 
For an efficient implementation of A*, a priority queue 
data structure is required that allows us to remove the 
node with the minimum /-value for expansion. If the 
heuristic function h is consistent, then on each search 
path, no successor will have a smaller f-value than the 
predecessor. Therefore, A* - traversing the node set in 
f-order - expands each node at most once. 

External A* by Edelkamp et al. (2004) maintains the 
search frontier on disk as files. Each file corresponds 
to an external representation of a bucket-based priority 
queue data structure. A bucket is a set of nodes sharing 
common properties. In a 1-level bucket implementation 
of A* by Dial ( 1 969) each bucket addressed with index 
i contains all nodes u that have priority f(n)=i. Arefine- 
ment proposed by Jensen et al. (2002) distinguishes 
between nodes with different g-values, and designates 
bucket Open(iJ) to all nodes n with path length g(n)=i 
and heuristic estimate h(n)=j. An external memory rep- 
resentation of this data structure memorizes each bucket 
in a different file. During the exploration process, only 
nodes from Open(iJ) with i+j-f are expanded, up to its 
exhaustion. Buckets are selected in lexicographic order 
for (ij). By that time, the buckets Open(i 'J ') with i '<i 
and i '+/' -fare closed, whereas the buckets Open(i'J ') 
with i '+/ '>f or with i '>i and i '+j ' -f are open. Depending 
on the expansion progress, nodes in the active bucket 
are either open or closed. 

It is practical to pre-sort buffers in one bucket im- 
mediately by an efficient internal sorting algorithm to 
ease merging. Duplicates within an active bucket are 
eliminated by merging all the pre-sorted buffers cor- 



responding to the same bucket, resulting in one sorted 
file. This file can then be scanned to remove the duplicate 
nodes from it. In fact, both the merging and removal 
of duplicates can be done simultaneously. Another 
case of the duplicate nodes appears, when the nodes 
that have already been evaluated in the upper layers 
are generated again. As in the algorithm of Munagala 
and Ranade, External A* exploits the observation that, 
in an undirected problem graph, duplicates of a node 
with BFS-level i can at most occur in levels z, z'-l and 
z-2. In addition, since h is a total function, we have 
h(n) = h(n '), iin-n'. These duplicate nodes can be 
removed by file subtraction for the next active bucket 
Open(g+ l,h-\ ). We remove any node that has appeared 
in buckets Open(g,h-l) and Open(g-l,h-l). This file 
subtraction can be done by a mere parallel scan of the 
pre-sorted files and by using a temporary file in which 
the intermediate result is stored. It suffices to perform 
the duplicate removal only for the bucket that is to be 
expanded next. 

Duplicate Detection Scope 

The number of previous layers that are sufficient for full 
duplicate detection in directed graphs, is dependent on 
a property of the search graph called locality (Zhou and 
Hansen, 2006). In the following, we generalize their 
concept to weighted and directed search graphs. For a 
problem graph with node set V, discrete cost function 
c, successor set Succ, initial state s, and S being defined 
as the minimal cost between two states, the shortest- 
path locality is defined as L = max{<5(s,n) - 8(s,n ') + 
c(n,n ')\n,n'<=V,n'^ Succ(n)}. In unweighted graphs, 
we have c{n,n J )-l for all n, n\ Moreover, S(s,n) and 
S(s,n ') differ by at most 1, so that the locality is 2, 
which is consistent with the observation of Munagala 
and Ranade. 

The locality determines the thickness of the search 
frontier needed to prevent duplicates from appearing 
in the search. While the locality is dependent on the 
graph, the duplicate detection scope also depends on 
the search algorithm applied. For BFS, the search tree 
is generated with increasing path lengths (number of 
edges), while for weighted graphs the search tree is 
generated with increasing path cost (this corresponds 
to Dijkstra's algorithm in the one-level bucket priority 
queue data structure). 

In a positively weighted search graph, the number 
of buckets that need to be retained to prevent duplicate 
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search effort is equal to the shortest-path locality of 
the search graph. Let us consider two nodes n and n ', 
with n ' g Si/cc(n). Assume that n has been expanded 
for the first time, generating the successor n ' which has 
already appeared in the layers 0, ..., S(s,n)-L implying 
S(s,n ') < S(s,n) - L. We have, L > S(s,n) - S(s,n ') + 
c(n,n ') > S(s,n) - (S(s,n) - L) + c(n,n ') = L + c(n,?z '), 
in contradiction to c(n,n ') > 0. 

Refinements 

Improvements of Munagala and Ranade's algorithm 
for explicit undirected graphs have been proposed by 
Mehlhorn and Meyer (2002), where a more structured 
access to the adjacency lists has been proposed. Aj wani 
et al. (2007) present an extensive empirical comparison 
of these approaches on different kinds of graphs. 

Hash-based delayed duplicate detection (Korf and 
Schultze, 2005) is designed to avoid the complexity of 
sorting. It is based on two orthogonal hash functions. 
The primary hash function distributes the nodes to 
different files. Once a file of successors has been gener- 
ated, duplicates are eliminated. The assumption is that 
all nodes with the same primary hash address fit into 
main memory. The secondary hash function maps all 
duplicates to the same hash address. 

Structured duplicate detection (Zhou and Hansen, 
2004) builds up an abstract graph on top of the problem 
graph through a disjoint partitioning. For expansion, 
all states that are mapped to the same abstract node 
are loaded into the memory along with the nodes that 
are mapped to the neighbouring abstract nodes. The 
partition is defined in such a way that any two adja- 
cent nodes in the graph are mapped either to the same 
partition or to a neighbouring abstract partition. The 
successor nodes are checked against the neighbouring 
partitions and are removed if found as duplicates, as 
soon as they are generated. 

Applications 

Implementations for external model checking algorithms 
have been proposed by Kristensen and Mailund (2003), 
who suggested a sweep-line technique for scanning the 
search space according to a given partial order. For 
general LTL (Linear Temporal Logic) model checking, 
Edelkamp and Jabbar (2006) have extended External 
A* for safety and liveness checking and integrated it 
into the state-of-the-art model checker, SPIN. 
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FUTURE TRENDS 

It is often the case that external memory algorithms 
can be lifted to parallel algorithms. With the advent of 
multi-core processors and affordable PC clusters, paral- 
lel algorithms become more and more important. Jabbar 
and Edelkamp (2006) provide a parallel implementation 
of the External A* for model checking safety properties . 
For probabilistic and non-deterministic models value 
iteration (VI) has been extended by Edelkamp et al. 
(2007) to work on large state spaces that cannot fit into 
the RAM. Instead of working on states, it works on 
edges (n,n \a,h(n ')), where n is called the predecessor 
state, n ' the stored state, a the action that transforms n 
into n \ and h(n ') is the current value for n '. Similarly 
to the internal version of VI, the external version of 
VI works in two phases. A forward phase, where the 
state space is generated, and a backward phase, where 
the heuristic values are repeatedly updated until an e- 
optimal policy is computed, or a maximum iterations 
are performed. 



CONCLUSION 

We have presented a brief overview of disk-based search 
algorithms. These algorithms are especially designed 
for large graphs that cannot fit into the RAM. An ex- 
ternal variant of one of the basic search algorithms, 
i.e., Breadth-first search has been introduced. For the 
domains where a search algorithm can be equipped 
with some form of guidance to reduce the search ef- 
forts, External A* provides a complete and I/O efficient 
extension of the famous A* search algorithm. The whole 
paradigm of disk-based search is largely dependant on 
alternate forms of duplicate detection schemes. The 
most general one is sorting-based delayed duplicate 
detection. For special problems where a good disjoint 
partitioning of the graph is possible, hash-based dupli- 
cate detection and structured duplicated detection are 
feasible choices. We have also presented a generaliza- 
tion of the duplicate detection scope that dictates the 
number of previous layers that have to be checked to 
guarantee that no node will be expanded twice. Finally, 
we saw some future trends directed towards an efficient 
utilization of modern multi-core hardware and to policy 
search methods. 

STXXL (Dementiev et al., 2005) provides an ef- 
ficient library for external memory data structures and 
algorithms. 
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KEY TERMS 

Delayed Duplicate Detection: In difference to hash 
tables that eliminate duplicate states on-the-fly during 
the exploration, the process of duplicate detection can 
be delayed until a large set of states is available. It is 
very effective in external search, where it is efficiently 
achieved by external sorting and scanning. 

Graph: A set of nodes connected through edges. The 
node at the head of an edge is called as the target and 
at the tail as the source. A graph can be undirected, i.e., 
it is always possible to return to the source through the 
same edge - the converse is a directed graph. If given 
beforehand in the form of adjacency lists (e.g., a road 
network), we call it an explicit graph. Implicit graphs 
- another name for ' state spaces' - are generated on- 
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the-fly from a start node and a set of rules/actions to 
generate the new states (e.g., a checkers game). 

Heuristic Function: A function that assigns a node, 
an estimated distance to the goal node. For example, in 
route-planning, the Euclidean distance can be used as 
a heuristic function. A heuristic function is admissible, 
if it never overestimates the shortest path distance. It is 
also consistent, if it never decreases on any edge more 
than the edge weight, i.e., for a node n and its successor 
n ', h(n) - h{n ') < c(n,n '). 

Memory Hierarchy: Modern hardware has a 
hierarchy of storage mediums: starting from the fast 
registers, the LI and L2 caches, moving towards RAM 
and all the way to the slow hard disks and tapes. The 
latency timings on different levels differ considerably, 
e.g., registers: 2ns, cache: 20ns, hard disk: 10ms, tape: 
lmin. 

Model Checking: It is an automated process that 
when given a model of a system and a property specifi- 
cation, checks if the property is satisfied by the system 
or not. The properties requiring that 'something bad will 
never happen ' are referred as safety properties, while 
the ones requiring that ' something good will eventually 
happen ' are referred as liveness properties. 



Search Algorithm: An algorithm that when given 
two graph nodes, start and goal, returns a sequence 
of nodes that constitutes a path from start to the goal, 
if such a sequence exists. A search algorithm gener- 
ates the successors of a node through an expansion 
process, after which, the node is termed as a closed 
node. The newly generated successors are checked for 
duplicates, and when found as unique, are added to the 
set of open nodes. 

Value Iteration: Procedure that computes a policy 
(mapping from states to action) for a probabilistic or 
non-deterministic search problem most frequently in 
form of a Markov Decision Problem (MDP). 
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INTRODUCTION 

Distributed constraint reasoning is concerned with 
modeling and solving naturally distributed problems. 
It has application to the coordination and negotiation 
between semi-cooperative agents, namely agents that 
want to achieve a common goal but would not give 
up private information over secret constraints. When 
compared to centralized constraint satisfaction (CSP) 
and constraint optimization (COP), one of the most 
expensive operations is communication. Other differ- 
ences stem from new coherence and privacy needs. We 
review approaches based on asynchronous backtracking 
and depth-first search spanning trees. 

Distributed constraint reasoning started as an 
outgrowth of research in constraints and multi-agent 
systems. Take the sensors network problem in Figure 
1, defined by a set of geographically distributed sen- 
sors that have to track a set of mobile nodes. Each 
sensor can watch only a subset of its neighborhood 



Figure 1. Sensor network 
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at a given time. Three sensors need to simultaneously 
focus on the same mobile node in order to locate it. 
Approaches modeling and solving this problem with 
distributed constraint reasoning are described in (Bej ar, 
Domshlak, Fernandez, Gomes, Krishnamachari, Sel- 
man, & Vails, 2005). 

There are two large classes of distributed constraint 
problems . The first class is described by a set of Boolean 
relations (aka constraints) on possible assignments of 
variables, where the relations are distributed among 
agents. They are called distributed constraint satis- 
faction problems (DisCSPs). The challenge is to find 
assignments of variables to values such that all these 
relations are satisfied. However, the reasoning process 
has to be performed by collaboration among the agents. 
There exist several solutions to a problem, and ties have 
to be broken by some priority scheme. Such priorities 
may be imposed from the problem description where 
some agents, such as government agencies, are more 
important than others. In other problems it is important 
to ensure that different solutions or participants have 
equal chances, and this property is called uniformity. 
When no solution exists, one may still want to find an 
assignment of the variables that conflict as few con- 
straints as possible. The second class of problems refers 
to numerical optimization described by a set of func- 
tions (weighted constraints) defined on assignments of 
variables and returning positive numerical values. The 
goal is to find assignments that minimize the objective 
function defined by the sum of these functions. The 
problems obtained in this way are called distributed 
constraint optimization problems (DisCOPs). Some 
problems require a fair distribution of the amount of 
dissatisfaction among agents, minimizing the dissat- 
isfaction of the most unsatisfied agent. 

There are also two different ways of distributing 
a problem. The first way consists of distributing the 
data associated with it. It is defined in terms of which 
agents know which constraints. It can be shown that any 
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such problem can be translated into problems where all 
non-shared constraints are unary (constraints involv- 
ing only one variable), also called domain constraints. 
Here one can assume that there exists a single unary 
constraint for each variable. It is due to the fact that any 
second unary constraint can be reformulated on a new 
variable, required to be equal to the original variable. 
The agent holding the unique domain constraint of a 
variable is called the owner of that variable. Due to 
the availability of this transformation many solutions 
focus on the case where only the unary constraints are 
not shared by everybody (also said to be private to the 
agents that know them) . Another common simplification 
consists in assuming that each agent has a single unary 
constraint (i.e., a single variable). This simplification 
does not reduce the generality of the addressable prob- 
lems since an agent can participate in a computation 
under several names, e.g., one instance for each unary 
constraint of the original agent. Such false identities 
for an agent are called pseudo-agents (Modi, Shen, 
Tambe, & Yokoo, 2005), or abstract agents (Silaghi 
& Faltings, 2005). 

The second way of distributing a problem is in terms 
of who may propose instantiations of a variable. In such 
an approach each variable may be assigned a value 
solely by a subset of the agents while the other agents 
are only allowed to reject the proposed assignment. 
This distribution is similar to restrictions seen in some 
societies where only the parliament may propose a ref- 
erendum while the rest of the citizens can only approve 
or reject it. Approaches often assume the simultaneous 
presence of both ways of distributing the problem. They 
commonly assume that the only agent that can make 
a proposal on a variable is the agent holding the sole 
unary constraint on that variable, namely its owner 
(Yokoo, Durfee, Ishida, & Kuwabara, 1998). When 
several agents are allowed to propose assignments of 
a variable, these authorized agents are called modifiers 
of that variable. An example is where each holder of a 
constraint on a variable is a legitimate modifier of that 
variable (Silaghi & Faltings, 2005). 



BACKGROUND 

The first challenge addressed was the development of 
asynchronous algorithms for solving distributed prob- 
lems. Synchronization forces distributed processes to 
run at the speed of the slowest link. Algorithms that do 



not use synchronizations, namely where participants 
are at no point aware of the current state of other par- 
ticipants, are flexible but more difficult to design. With 
the exception of a few solution detection techniques 
(Yokoo & Hirayama, 2005), (Silaghi & Faltings, 2005), 
most approaches gather the answer to the problem by 
reading the state of agents after the system becomes 
idle and reaches the so called quiescence state (Yokoo 
et al., 1998). Algorithms that eventually reach quies- 
cence are also called self-stabilizing (Collin, Dechter, 
& Katz, 1991). A complete algorithm is an algorithm 
that guarantees not to miss any existing solution. A 
sound algorithm is a technique that never terminates 
in a suboptimal state. 

Another challenge picked by distributed constraint 
reasoning research consists of providing privacy for the 
sub-problems known by agents (Yokoo et al., 1998). 
The object of privacy can be of different types. The 
existence of a constraint between two variables may 
be secret as well as the existence of a variable itself. 
Many approaches only try to ensure the secrecy of 
the constraints, i.e., the hiding of the identity of the 
valuations that are penalized by that constraint. For 
optimization problems one also assumes a need to 
keep secret the amount of the penalty induced by the 
constraint. As mentioned previously, it is possible to 
model such problems in a way where all secret con- 
straints are unary (approach known as having private 
domains). Some problems may have both secret and 
public constraints. Such public constraints maybe used 
for an efficient preprocessing prior to the expensive 
negotiation implied by secret constraints. Solvers 
that support guarantees of privacy at any cost employ 
cryptographic multi-party computations (Yao 1982). 
There exist several cryptographic technologies for such 
computations, and some of them can be used inter- 
changeably by distributed problem solvers. However, 
some of them offer information theoretical security 
guarantees (Shamir, 1979) being resistant to any amount 
of computation, while others offer only cryptographic 
security (Cramer, Damgaard, & Nielsen, 2000) and 
can be broken using large amounts of computation or 
quantum computers. The result of a computation may 
reveal secrets itself and its damages can be reduced by 
being careful in formulating the query to the solver. 
For example, less information is lost by requesting 
the solution to be picked randomly than by request- 
ing the first solution. The computations can be done 
cryptographically by a group of semi-trusted servers, 
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or they can be performed by participants themselves. 
A third issue in solving distributed problems is raised 
by the size and the dynamism of the system. 



DISTRIBUTED CONSTRAINT 
REASONING 

Framework 

A common definition of a distributed constraint op- 
timization problem (DisCOP) (Modi et al., 2005) 
consists of a set of variables X={x ± , ..., xj and a set 
of agents A={A V ..., A J, each agent A holding a set 
of constraints. Each variable x can be assigned only 
with those values which are allowed by a domain 
constraint D . A constraint O. on a set of variables X 
is a function associating a positive numerical value to 
each combination of assignments to the variables X. 
The typical challenge is to find an assignment to the 
variables in X such that the sum of the values returned 
by the constraints of the agents is minimized. 

A tuple of assignments is also called a partial solu- 
tion. A restriction often used with DisCOPs requires 
that each agent A holds only constraints between xi 
and a subset of the previous variables, {x^.^x.J. 
Also, for any agent A., the agents {A p ..., A.J are the 



predecessors of A. and the agents {A. +1 , ..., AJ, are its 
successors. 

To understand the generality and limitations of this 
restriction, consider a conference organization problem 
with 3 variables x (time), x 2 (place), and x 3 (general 
chair) and 3 constraints @ 12 (between x and xj, @ 23 
(between x 2 andxj, and @ 13 (between x 1 andxj where 
Alice has @ in , Bob enforces @,„ and Carol is interested 

12. u 23 

in @ 13 , Figure 3. 

This problem can be modeled as a DisCOP with 4 
agents. Alice uses two agents, A 1 andA 2 . The original 
participant is called physical agent and the agents of 
the model are called pseudo-agents. Boh uses the agent 
A 3 and Carol uses an agent A . The new variable x 4 of 
the agent A 4 is involved in a ternary constraint @ 134 
with x t and x 3 . The constraint @ 134 is constructed such 
that its projection on x and x 2 is @ 13 . 

However the restricted framework cannot help gen- 
eral purpose algorithms to learn and exploit the fact that 
agents A 1 and A 2 know each other's constraints. It also 
requires finding an optimal value for the variable x 4 , 
which is irrelevant to the query. To avoid aforementioned 
limitations some approaches remove the restriction on 
which variables can be involved in the constraints of 
an agent and can obtain some improvements in speed 



Figure 2. Translating between DisCOP frameworks 
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(Silaghi & Faltings, 2005). Other frameworks typi- 
cally used with hill-climbing solvers, with solvers that 
reorder agents, and with arc consistency, assume that 
each agent A. knows all the constraints that involve the 
variable x . This implies that any constraint between 
two variables x and x. is known by both agents A and 
A .. In general, a problem modeled as a DisCOP where 
any private constraint may be hold by any agent can 
be converted to its dual representation in order to 
obtain a model with this framework. When penalties 
for constraint violation can only take values in {0,oo}, 
corresponding to {true, false}, one obtains distributed 
constraint satisfaction problems. 

Aprotocol is a set of rules about what messages may 
be exchanged by agents, when they may be sent, and 
what may be contained in their payload. A distributed 
algorithm is an implementation of a protocol as it speci- 
fies an exact sequence of operations to be performed as 
a response to each event, such as start of computation 
or receipt of a message. Autonomous self-interested 
agents are more realistically expected to implement 
protocols rather than to strictly adhere to algorithms. 
Protocols can be theoretically proved correct. However 
experimental validation and efficiency evaluation of a 
protocol is done by assuming that agents strictly follow 
some algorithm implementing that protocol. 

Efficiency Metrics 

The simplest metric for evaluating DisCOP solvers uses 
the time from the beginning of a distributed computa- 
tion to its end. It is possible only with a real distributed 
system (or a very realistic simulation). The network 
load for benchmarks is evaluated by counting the total 
number of messages exchanged or the total number of 
bytes exchanged. The total time taken by a simulator 
yields the efficiency of a DisCOP solver when used as a 
weighted CSP solver. Another common metric is given 
by the highest logic clocks (Lamport, 1978) occurring 
during the computation. Lamport's logic clocks asso- 
ciate a cost with each message and another cost with 
each local computation. When the cost assigned to each 
message is 1 and the cost for local computations is 0, 
the obtained value gives the longest sequential chain of 
causally ordered messages (Silaghi & Faltings, 2005). 
When all message latencies are identical, this metric 
is equivalent to the number of rounds of a simulator 
where at each round an agent handles all messages 
received in the previous round (Yokoo et al., 1998). 



If the cost assigned to each message is and the cost 
of a constraint check is 1, the obtained value gives the 
number of non-concurrent constraint checks (NCCC) 
(Meisels, Kaplansky, Razgon, & Zivan, 2002). When 
a constraint check is assumed to cost a fraction of a 
message then the obtained value gives the equivalent 
NCCCs (ENCCCs) . One can evaluate the actual fraction 
between message latencies and constraint checks in the 
operating point (OP) of the target application (Silaghi 
& Yokoo, 2007). However many distributed solvers 
do not check constraints directly but via nogoods and 
there is no standardized way of accounting the handling 
of the latter ones. 

Techniques 

Solving algorithms span the range between full cen- 
tralization, where all constraints are submitted to a 
central server that returns a solution, through incre- 
mental centralization (Mailler & Lesser, 2004), to very 
decentralized approaches (Walsh, Yokoo, Hirayama, 
& Wellman, 2003). 

The Depth-First Search (DFS) spanning trees of 
the constraint graph proves useful for distributed 
DisCOP solvers. When used as a basis for ordering 
agents, the assignment of any node of the tree makes 
its subtrees independent (Collin, Dechter, & Katz, 
2000). Such independence increases parallelism and 
decreases the complexity of the problem. The struc- 
ture can be exploited in three ways. Subtrees can be 
explored in parallel for an opportunistic evaluation of 
the best branch, reminding of iterative A* (Modi et 
al., 2005). Alternatively a branch and bound approach 
can systematically evaluate different values of the root 
for each subtree (Chechetka & Sycara, 2006). A third 
approach uses dynamic programming to evaluate the 
DFS trees from leaves towards the root (Petcu & Falt- 
ings, 2006). 

Asynchronous usage of lookahead techniques based 
on maintenance of arc consistency and hound consis- 
tency require handling of interacting data structures 
corresponding to different concurrent computations. 
Concurrent consistency achievement processes at dif- 
ferent depths in the search tree have to be coordinated 
giving priority to computations at low depths in the 
tree (Silaghi & Faltings, 2005). 

The concept at the basis of many asynchronous 
algorithms is the nogood, namely a self contained 
statement about a restriction to the valuations of the 
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variables, inferred from the problem. A generalized 
valued nogood has the form [R,c,T] where T specifies 
a set of partial solutions {N^.^NJ for which the set of 
constraints R specifies a penalty of at least c. A common 
simplification, called valued nogood (Dago & Verfaille, 
1 996), refers to a single partial solution, [R,c,N]. Priority 
induced vector clock timestamps called signatures can 
be used to arbitrate between conflicting assignments 
concurrently proposed by several modifiers (Silaghi & 
Faltings, 2005). They can also handle other types of 
conflicting proposals, such as new ordering. 

ADOPT-ing is an illustrative algorithm unifying 
the basic DisCSP and DisCOP solvers ABT (Yokoo et 
al., 1998) and ADOPT (Modi et al., 2005). It works by 
having each agent concurrently chose for its variable the 
best value given known assignments of predecessors 
and cost estimations received from successors (Silaghi 
& Yokoo 2007). Each agent announces its assignments 
to interested successors using ok? messages. Agents are 
interested in variables involved in their constraints or 
nogoods. When a nogood is received, agents announce 
new interests using add-link messages. A forest of 
DFS trees is dynamically built. Initially each agent is 
a tree, having no ancestors. When a constraint is first 
used, the agent adds its variables to his ancestors 
list and defines his parent in the DFS tree as the clos- 
est ancestor. Ancestors are announced of their own 
new ancestors. Nogoods inferred by an agent using 
resolution on its nogoods and constraints are sent to 
targeted predecessors and to its parent in the DFS 
tree using nogood messages, to guarantee optimality. 
Known costs of DFS subtrees for some values can be 
announced to those subtrees using threshold nogoods 
attached to ok? messages. 



Figure 3, constraints are represented as Boolean values 
in an array. The z th value in this array set to T signi- 
fies that the constraints of A. are used in the inference 

i 

of that nogood. The agents start selecting values for 
their variables and announce them to interested lower 
priority agents. The first exchanged messages are ok? 
messages sent by A 1 to both successors A 2 and A 3 and 
proposing the assignment x=l. A 2 sends an ok? mes- 
sage to A 3 proposing x=2. 

A 3 detects a conflict with x ± , inserts A t in its DFS 
tree ancestors list, and sends a nogood with cost 2 to 
A 1 (message 3). A x answers the received nogood by 
switching its assignment to a value with lower cur- 
rent estimated value, x : =2 (message 4). A 2 reacts by 
switching x 2 to its lowest cost value, x 2 =l (message 5). 
A 3 detects a conflict with x 2 and inserts A 2 in its ances- 
tors list, which becomes {A p A 2 }. A 3 also announces 
the conflict to A 2 using the nogood message 6. This 
nogood received by A 2 is combined with the nogood 
locally inferred by A 2 for its value 2 due to the constraint 
x^x 2 (#4). That inference also prompts the insertion of 
A x in the ancestors list of A 2 . The obtained nogood is 
therefore sent to A x using message 7. A x and later A 2 
switch their assignments to the values with the lowest 
cost, attaching the latest nogoods received for those 
values as threshold nogoods (messages 8, 9 and 10). 
At this moment the system reaches quiescence. 



FUTURE TRENDS 

The main remaining challenges with distributed con- 
straint reasoning are related to efficient ways of achiev- 
ing privacy and with handling very large problems. 



Example 

An asynchronous algorithm could solve the problem in 
Figure 2 using the trace in Figure 3. In the messages of 



Figure 3. The constraint graph of a DisCOP. The fact 
that the penalty associated with not satisfying the con- 
straint xfa 2 is 4, is denoted by the notation (#4). 
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CONCLUSION 

The distributed constraint reasoning paradigms allow 
easy specification of new problems. The notion varies 
largely between almost any two researchers. It can 
refer to the distribution of subproblems or it can refer 
to the distribution of authority in assigning variables. 
The reason and goals of the distribution vary as well, 
where either privacy of constraints, parallelism in 
computation, or size of data are cited as major concern. 
Most algorithms can be easily translated from one 
framework to the other, but they may not be appropri- 
ate for a new goal. 
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Figure 4. Simplified trace of an asynchronous solver (ADOPT-ing (Silaghi & Yokoo, 2007)) on the problem in 
Figure 3 
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KEY TERMS 

Agent: A participant in a distributed computation, 
having its own constraints. 

Constraint: Arelation between variables specifying 
a subset of their Cartesian product that is not permit- 
ted. Optionally it can also specify numeric penalties 
for those tuples. 



DisCOP: Distributed Constraint Optimization 
Problem framework (also DCOP). 

DisCSP: Distributed Constraint Satisfaction Prob- 
lem framework (also DCSP). 

Nogood : Alogic statement about combinations of as- 
signments that are penalized due to some constraints. 

Optimality: The quality of an algorithm of return- 
ing only solutions that are at least as good as any other 
solution. 

Quiescence: The state of being inactive. The system 
will not change without an external stimulus. 
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INTRODUCTION 



BACKGROUND 



AI models are often categorized in terms of the con- 
nectionist vs. symbolic distinction. In addition to being 
descriptively unhelpful, these terms are also typically 
conflated with a host of issues that may have nothing 
to do with the commitments entailed by a particular 
model. A more useful distinction among cognitive rep- 
resentations asks whether they are local or distributed 
(van Gelder 1999). 

Traditional symbol systems (grammar, predicate 
calculus) use local representations: a given symbol has 
no internal content and is located at a particular address 
in memory. Although well understood and successful in 
a number of domains, traditional representations suffer 
from brittleness. The number of possible items to be 
represented is fixed at some arbitrary hard limit, and a 
single corrupt memory location or broken pointer can 
wreck an entire structure. 

In a distributed representation, on the other hand, 
each entity is represented by a pattern of activity 
distributed over many computing elements, and each 
computing element is involved in representing many 
different entities (Hinton 1984). Such representa- 
tions have a number of properties that make them 
attractive for knowledge representation (McClelland, 
Rumelhart, & Hinton 1986): they are robust to noise, 
degrade gracefully, and support graded comparison 
through distance metrics. These properties enable fast 
associative memory and efficient comparison of entire 
structures without unpacking the structures into their 
component parts. 

This article provides an overview of distributed 
representations, setting the approach in its historical 
context. The two essential operations necessary for 
building distributed representation of structures - bind- 
ing and bundling - are described. We present example 
applications of each model, and conclude by discussing 
the current state of the art. 



The invention of the backpropagation algorithm (Ru- 
melhart, Hinton, & Williams 1986) led to a flurry of 
research in which neurally inspired models were ap- 
plied to tasks for which the use of traditional AI data 
structures and algorithms were commonly assumed to 
be the only viable approach. A compelling feature of 
these new models was that they could "discover" the 
representations best suited to the modelling domain, 
unlike the manmade representations used in traditional 
AI. These discovered or learned representations were 
typically vectors of numbers in a fixed interval like [0, 
1], representing the values of the hidden variables. A 
statistical technique like principal component analysis 
could be applied to such representations, revealing inter- 
esting regularities in the training data (Elman 1990). 

Issues concerning the nature of the representa- 
tions learned by backpropagation led to criticisms of 
this work. The most serious of these held that neural 
networks could not arrive at or exploit systematic, 
compositional representations of the sort used in tra- 
ditional cognitive science and AI (Fodor & Pylyshyn 
1988). A minimum requirement noted by critics was 
that a model that could represent e.g. the idea John loves 
Mary should also be able to represent Mary loves John 
(systematicity) and to represent John, Mary, and loves 
individually in the same way in both (compositionality) . 
Critics claimed that neural networks are in principle 
unable to meet this requirement. 

Systematicity and compositionality can be thought 
of as the outcome of two essential operations: binding 
and bundling. Binding associates fillers (John, Mary) 
with roles (lover, beloved). Bundling combines role/ 
filler bindings to produce larger structures. Crucially, 
representations produced by binding and bundling 
must support an operation to recover the fillers of 
roles: it must be possible to ask "Who did what to 
whom?" questions and get the right answer. Starting 
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around 1990, several researchers began to focus their 
attention on building models that could perform these 
operations reliably. 



VARIETIES OF DISTRIBUTED 
REPRESENTATION 

This article describes the various approaches found in 
the recent neural network literature to implementing 
the binding and bundling operations. Although several 
different models have been developed, they fall into 
one of two broad categories, based on the way that 
roles are represented and how binding and bundling 
are performed. 

Recursive Auto-Associative Memory 

In Recursive Auto-Associative Memory, or RAAM 
(Pollack 1990), fillers are represented as relatively 
small vectors (N= 10-50 elements) of zeros and ones. 
Roles are represented as N x N matrices of real values, 
and role/filler binding as the vector/matrix product. 
Bundling is performed by element-wise addition of 
the resulting vectors. There are typically two or three 
role matrices, representing general role categories like 
agent and patient, plus another N x N matrix must to 
represent the predicate (loves, sees, knows, etc.). Be- 
cause all vectors are the same size N, vectors containing 
bindings can be used as fillers, supporting structures of 
potentially unlimited complexity (Bill knows Fred said 
John loves Mary.) The goal is to learn a set of matrix 
values (weights) to encode a set of such structures. 



In order to recover the fillers, a corresponding set 
of matrices must be trained to decode the vectors pro- 
duced by the encoder matrices. Together, the encoder 
and decoder matrices form an autoassociator network 
(Ackley, Hinton, & Sejnowski 1985) that canbe trained 
with backpropagation. The only additional constraint 
needed for backprop is that the vector/matrix products 
be passed through a limiting function, like the sigmoidal 
"squashing" function f(x) = 1 / (1 + e~ x ), whose output 
falls in the interval (0,1). Figure 1 shows an example 
of autoassociative learning for a simple hypothetical 
structure, using three roles, with N=4. The same net- 
work is shown at different stages of training (sub-tree 
and full tree) during a single backprop epoch. Note that 
the network devises its own compositional represen- 
tations on its intermediate ("hidden") layer, based on 
arbitrary binary vectors chosen by the experimenter. 
Unlike these binary vectors (black and white units), the 
intermediate representations can have values between 
zero and one (grey scale). 

Once the RAAM network has learned a set of 
structures, the decoder sub-network should be able to 
recursively unpack each learned representation into its 
constituent elements. As shown in Figure 2, decoding is 
a recursive process that terminates when the decoder's 
output is similar enough to a binary string and continues 
otherwise. In the original RAAM formulation, "similar 
enough" was determined by thresholds: if a unit's value 
was above 0.8, it was considered to be on, and if it was 
below 0.2 it was considered to be off. 

RAAM answered the challenge of showing how neu- 
ral networks could represent compositional structures 
in a systematic way. The representations discovered by 
RAAM could be compared directly via distance metrics, 



Figure 1. Learning the structure (knows bill (loves John mary))) with RAAM 
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and transformed in a rule-like way, without having to 
recursively decompose the structure elements as in tra- 
ditional localist models (Chalmers 1990). Nevertheless, 
the model failed to scale up reliably to data sets of more 
than a few dozen different structures. This limitation 
arose from the termination test, which created a variety 
of "halting problem": decoding often terminated too 
soon or continued indefinitely. In addition, encodings 
of novel structures were typically decoded to already- 
learned structures, a failure in generalization. 

A number of solutions were developed to deal 
with this problem. One solution (Levy and Pollack 
2001) built on the insight that the RAAM decoder is 
essentially an iterated function system, or IFS (Barnsley 
1993). Use of the sigmoidal squashing function en- 
sures that this IFS has an attractor, which is the infinite 
set (Cantor dust) of vectors reachable on an infinite 
number of feedback iterations from any initial vector 
input to the decoder. A more "natural" termination 
test is to check whether the output is a member of the 
set of vectors that make up this attractor. Fixing the 
numerical precision of the decoder results in a finite 
number of representable vectors, and a finite time to 
reach the attractor, so that membership in the attractor 
can be determined efficiently. 

This approach to the termination test produced a 
RAAM decoder that could store a provably infinite 
number of related structures (Melnik, Levy, & Pollack 
2001). Because the RAAM network was no longer an 
autoassociator, however, it was not clear what sort of 
algorithm could replace backpropagation for learning 
a specific, finite set of structures. 

The other solution to the RAAM scaling problem 
discarded the nonlinear sigmoidal squashing function 
and replaced backprop with principal components 



analysis (PCA) as a means of learning internal rep- 
resentations (Callan 1996). This approach yielded 
the ability to learn a much larger set of structures in 
many fewer iterations (Voegtlin & Dominey 2005), 
and showed generalization similar in some respects to 
what has been observed for children acquiring a first 
language (Tomasello 1992). 

Vector Symbolic Architectures 

Vector Symbolic Architectures is a term coined by Gayler 
(2003) for a general class of distributed representation 
models that implement binding and bundling directly, 
without an iterative learning algorithm of the sort used 
by RAAM. These models can trace their origin to 
the Tensor Product model of Smolensky (1990). Ten- 
sor-product models represent both fillers and roles as 
vectors of binary or real-valued numbers. Binding is 
implemented by taking the tensor (outer) product of a 
role vector and a filler vector, resulting in a mathemati- 
cal obj ect (matrix) having one more dimension than the 
filler. Given vectors of sufficient length, each tensor 
product will be unique. As with RAAM, bundling can 
then be implemented as element- wise addition (Figure 
3), and bundled structures can be used as roles, opening 
the door to recursion. To recover a filler (role) from a 
bundled tensor product representation, the product is 
simply divided by the role (filler) vector. 

Because the dimension of the tensor product 
increases with each binding operation, this method 
suffers from the well-known "curse of dimensional- 
ity" (Bellman 1961). As more recursive embedding 
is performed, the size of the representation grows 
exponentially. The solution is to collapse the N x N 
role/filler matrix back into a length-iV vector. As shown 



Figure 2. Decoding a structure to its constituents 
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in Figure 4, there are two ways of doing this. In Binary 
Spatter Coding, or BSC (Kanerva 1994), only the ele- 
ments along the main diagonal are kept, and the rest 
are discarded. If bit vectors are used, this operation is 
the same as taking the exclusive or (XOR) of the two 
vectors. In Holographic Reduced Representations, or 
HRR (Plate 1991), the sum of each diagonal is taken, 
with wraparound (circular convolution) keeping the 
length of all diagonals equal. Both approaches use very 
large (N > 1000 elements) vectors of random values 
drawn from a fixed set or interval. 

Despite the size of the vectors, VSA approaches 
are computationally efficient, requiring no costly 
backpropagation or other iterative algorithm, and can 
be done in parallel. Even in a serial implementation, 
the BSC approach is 0(N) for a vector of length N, and 
the HRR approach can be implemented using the Fast 



Fourier Transform, which is 0(N log N). The price 
paid is that most of the crucial operations (circular 
convolution, vector addition) are a form of lossy com- 
pression that introduces noise into the representations. 
The introduction of noise requires that the unbinding 
process employ a "cleanup memory" to restore the 
fillers to their original form. The cleanup memory can 
be implemented using Hebbian auto-association, like 
a Hopfield Network (Hopfield 1982) or Brain-State- 
in-a-Box model (Anderson, Silverstein, Ritz, & Jones 
1977). In such models the original fillers are attractor 
basins in the network's dynamical state space. These 
methods can be simulated by using a table that stores 
the original vectors and returns the one closest to the 
noisy version. 




Figure 3. Building a tensor product representation of John loves Mary 



O 
< 

m 

73 



• 


ooooo 


QJ 


• 


o#oo« 




o»oo# 


o 


••oo« 


m 


• 


o#oo« 




••oo# 


• 


ooooo 


+ 5 


O 


OOOOO 


_ 


ooooo 


o 


••oo« 


< 

m 


O 


OOOOO 




••oo« 


o 


ooooo 


a 


O 


OOOOO 




ooooo 




• •OC« 




omocm 


./ 


ohn laves Mary 



John 



mary 
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FUTURE TRENDS 
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CONCLUSION 
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way: Recursive Auto-Associative Memory (RAAM) 
and Vector Symbolic Architectures (VSA). The main 
difference between the two approaches lies in the 
way that representations are learned: RAAM uses 
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KEY TERMS 

Binary Spatter Codes (BSC)rVSAusing bit vectors 
and element- wise exclusive-or (XOR) or multiplication 
for role/filler binding. 

Binding: In its most general sense, a term used to 
describe the association of values with variables. In 
AI and cognitive science, the variables are usually a 
closed set of roles (A GENT, PA TIENT, INSTR UMENT) 
and the values an open set of fillers (entities). 

Bundling: VSA operation for combining several 
items into a single item, through vector addition. 

Cleanup Memory: Mechanism required to com- 
pensate for noise introduced by lossy compression in 
VSA. 

Distributed Representation: A general method 
of representing and storing information in which the 
representation of each item is spread across the entire 
memory, each memory element simultaneously stores 
the components of more than one item, and items are 
retrieved by their content rather than their address. 

Holographic Reduced Representation (HRR): 

The most popular variety of VSA, uses circular con- 
volution to bind fillers to roles and circular correlation 
to recover the fillers or roles from the bindings. 

Recursive Auto- Associative Memory (RAAM): 

Neural network architecture that uses vector/matrix 
multiplication for binding and iterative learning to 
encode structures. 

Tensor Products: Early form of VSA that uses the 
outer (tensor) product as the binding operation, 
thereby increasing the dimensionality of the representa- 
tions without bound. 

Vector Symbolic Architecture (VSA): General 
term for representations that use large vectors of random 
numbers for roles and fillers, and fast, lossy compres- 
sion operations to bind fillers to roles. 
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INTRODUCTION 

Evolutionary algorithms (EA) (Rechenberg, 1 965) be- 
long to a family of stochastic search algorithms inspired 
by natural evolution. In the last years, EA were used 
successfully to produce efficient solutions for a great 
number of hard optimization problems (Beasley, 1997). 
These algorithms operate on a population of potential 
solutions and apply a survival principle according to a 
fitness measure associated to each solution to produce 
better approximations of the optimal solution. At each 
iteration, a new set of solutions is created by selecting 
individuals according to their level of fitness and by 
applying to them several operators. These operators 
model natural processes, such as selection, recombina- 
tion, mutation, migration, locality and neighborhood. 
Although the basic idea of EA is straightforward, 
solutions coding, size of population, fitness function 
and operators must be defined in compliance with the 
kind of problem to optimize. 

Multi-class problems with binary SVM (Support 
Vector Machine) classifiers are commonly treated as a 
decomposition in several binary sub-problems. An open 
question is how to properly choose all models for these 
sub-problems in order to have the lowest error rate for 
a specific SVM multi-class scheme. In this paper, we 
propose a new approach to optimize the generaliza- 
tion capacity of such SVM multi-class schemes. This 
approach consists in a global selection of models for 
sub-problems altogether and is denoted as multi-model 
selection. A multi-model selection can outperform the 
classical individual model selection used until now in 
the literature, but this type of selection defines a hard 
optimisation problem, because it corresponds to a search 



a efficient solution into a huge space. Therefore, we 
propose an adapted EA to achieve that multi-model 
selection by defining specific fitness function and 
recombination operator. 



BACKGROUND 

The multi-class classification problem refers to as- 
signing a class to a feature vector in a set of possible 
ones. Among all the possible inducers, Support Vector 
Machine (SVM) have particular high generalization 
abilities (Vapnik, 1998) and have become very popu- 
lar in the last few years. However, SVM are binary 
classifiers and several combination schemes were 
developed to extend SVM for problems with more 
two classes (Rifkin & Klautau, 2005). These schemes 
are based on different principles: probabilities (Price, 
Knerr, Personnaz & Dreyfus, 1994), error correcting 
codes (Dietterich, & Bakiri, 1995), correcting clas- 
sifiers (Moreira, & Mayoraz, 1998) and evidence 
theory (Quost, Denoeux & Masson, 2006). All these 
combination schemes involve the following three 
steps: 1) decomposition of a multi-class problem into 
several binary sub-problems, 2) SVM training on all 
sub-problems to produce the corresponding binary 
decision functions and 3) decoding strategy to take 
a final decision from all binary decisions. Difficulties 
rely on the choice of the combination scheme (Duan 
& Keerthi, 2005) and how to optimize it (Lebrun, 
Charrier, Lezoray & Cardot, 2005). 

In this paper, we focus on step 2) when steps 1) and 
3) are fixed. For that step, each binary problem needs 
to properly tune the SVM hyper-parameters (model) in 
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order to have a global low multi-class error rate with the 
combination of all binary decision functions involved 
in. The search for efficient values of hyper-parameters 
is commonly designed by the term of model selection. 
The classical way to achieve optimization of multi-class 
schemes is an individual model selection for each re- 
lated binary sub-problem. This methodology overtones 
that a multi-class scheme based on SVM combination 
is optimal when each binary classifier involved in 
that combination scheme is optimal on the dedicated 
binary problem. But, if it is supposed that a decoding 
strategy can more or less easily correct binary classi- 
fiers errors, then individual binary model selection on 
each binary sub-problem cannot take into account error 
correcting possibilities. For this main reason, we are 
thinking that another way to achieve optimization of 
multi-class schemes is a global multi-model selection 
for binary problems altogether. In fact, the goal is to 
have a minimum of errors on a muti-class problem. 
The selection of all sub-problem models (multi-model 
selection) has to be globally performed to achieve that 
goal, even if that means that error rates are not optimal 
on all binary sub-problems when they are observed 
individually. E A is an efficient meta-heuristic approach 
to realize that multi-model selection. 



EA MULTI-MODEL SELECTION 

This section is decomposed in 3 subsections. In the first 
section, the multi-model optimization problem for muti- 
class combination schemes is exposed. More details 
than in previous section and useful notations for next 
subsections are introduced. In the second section, our 
EAmulti-model selection is exposed. Details on fitness 
estimation of multi-model and crossover operator over 
them are described. In the third section, experimental 
protocol and results with our EA multi-model selec- 
tion are provided. 

Multi-Model Optimization Problem 

A multi-class combination scheme induces several 
binary sub-problems. The number k and the nature 
of binary sub-problems depend on the decomposition 
involved in the combination scheme. For each binary 
sub-problem, a SVM must be trained to produce an 
appropriate binary decision function h. (1 < i < k). The 
quality of h. is greatly dependent on the selected model 



0. and is characterized by the expected error rate e. for 
new datasets with the same binary decomposition. 
Each model 0. contains all hyper-parameters values 
for training a SVM on dedicated binary sub-problem. 
Expected error rate e. associated to a model 0. is com- 
monly determined by cross-validation techniques. All 
the 0. models constitute the multi-model = (0 1? ..., Q k ). 
The expected error rate e of a SVM multi-class com- 
bination scheme is directly dependent on the selected 
multi-model 0. Let denote the multi-model space 
for a multi-class problem (i.e. V0 :0 e 0) and 0. the 
model space for the z th binary sub-problem. The best 
0* multi-model is the one for which expected error e is 
minimum and corresponds to the following optimiza- 
tion problem: 




6 * = arg min e(0 ) 

9g0 



(i) 



where e(0) denotes the expected error e of a multi-class 
combination scheme with the multi-model 0. The huge 



size of the multi-model space (0 = 



ie[l,k] 



0.) makes 



the optimization problem (0.1) very hard. To reduce 
the optimization problem complexity, it is classic to 
use the following approximation: 



e~ = {argmine(9 / )|ze[l,/c]} 

9e0 



Hypothesis is made that 
e(6>e(8-> 

This hypothesis also supposes that 



(2) 



<e,>e(8;). 



If it is evident that each individual model 0. in the best 

i 

multi-model 0*must correspond to efficient SVM (i.e. 
low value of e.) on the corresponding z th binary sub- 
problem, all best individual models (0^,. . .,0 k *) do not 
necessarily define the best multi-model 0*. The first 

reason is that all error rates e. are estimated with some 

i 

tolerance and combination of all these deviations can 
have a great impact on the final multi-class error rate 
e. The second reason is that even if all the binary clas- 
sifiers of a combination scheme have identical e. error 

z 

rates for different multi-models, these binary classifiers 
can have different binary class predictions for a same 
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example according to the used multi-model. Indeed 
multi-class predictions by combining these binary 
classifiers could be different for a same feature vector 
example since the correction involved in a given de- 
coding strategy depends on the nature of the internal 
errors of the binary classifiers (mainly, the number of 
errors). Then, multi-class classification schemes with 
the same internal-errors e., but different multi-models 
0, can have different capacities of generalization. For 
all these reasons, we claim that multi-model optimiza- 
tion problem (0.1) can outperform individual model 
optimization (0.2). 

Evolutionary Optimization Method 

Within our AE multi-model selection method, a fitness 
measure f is associated to a multi-model which is all 
the more large as the error e associated to is small; this 
enables to solve (0.1) optimization problem. Fitness 
value is normalized in order to have f=l when error e 
is zero and f=0 when error e corresponds to a random 
draw. Moreover, the number of examples in each class 
are not always well balanced for many multi-class 
datasets; to overcome this, the error e corresponds to a 
Balanced Error Rate (BER). As regards these two key 
points, the proposed fitness formulation is: 



f = 



i 



i— 



i-i. 



(3) 



with n c denoting the number of classes in a multi-class 
problem. In the same way, the internal-fitness f. is 
defined as f. = 1 - 2e. for the z th binary classifier with 
corresponding BER e.. 

The E A crossover operator for the combination of 
two multi-models 1 and 2 must favor the selection 
of most efficient models in these two multi-models. 
It is worth noting that one should not systematically 
select all the best models to produce an efficiency child 
multi-model as explained in previous sub-section. For 
each sub-problem, internal-fitness f. 1 and ff are used to 
determine the probability 



Pi 



(f'i 



(f'hif'l 



(4) 



to select the z th model in 9 1 as the z th model in the 
child multi-model 0. £/ denotes the internal fitness 
of the z th binary classifier with the multi-model 
7 . For the child multi-models generated by the 
crossover operator, an important advantage is 
that no new SVM training is necessary if all the 
related binary classifiers were already trained. 
In contrast, only the BER error rates of all child 
multi-models have to be evaluated. SVM Training 
is only necessary for the first step of the EA and 
when models go through a mutation operator. 

The rest of our EA for multi-model selection is 
similar to other EA approaches. First, at initialization 
step, a population of X multi-models is generated at 
random. Each model 0/(1 < z < X, 1 <j < k) corresponds 
to an uniform random within all possible values of SVM 
hyper-parameters. New multi-models are produced 
by combination of multi-models couples selected by a 
Stochastic Universal Sampling (SUS) strategy. A fixed 
selective pressure p s is used for the SUS selection. Each 
model 0/ has a probability of pjk to mutate (uniform 
random as for the initialization step of EA). Fitness f 
of all child multi-models are then evaluated. A second 
selection step is used to define the population of the 
next iteration of our EA. X individuals are selected by 
a SUS strategy (same selective pressure p s is used) 
from both the X parents and the X children. Its become 
the multi-models population in the next iteration. The 
number of iterations of EA is fixed to n .At the end 

max 

of the EA, the multi-model with the best fitness f from 
all these iterations is selected as 0*. 

Experimental Results 

In this section, three well known multi-class datasets 
are used: Satimage (n c = 6), Letter (n c = 26) from the 
Statlog collection (Blacke & Merz, 1998), and USPS 
(n c = 10) dataset (Vapnik, 1998). In (Wu, Lin & Weng, 
2004), two sampling sizes of 300/500 and 800/1000 
are used to constitute training/testing datasets. For 
each sampling sizes, 20 random splits are generated. 
We have used the same sampling sizes and the same 
split for the 3 datasets: Satimage, Letter and USPS. 
Two optimization methods are used for the selection 
of the best multi-model 0*for each training datasets. 
The first one is the classical individual model selection 
and the second one is our EA multi-model selection. 
For both methods, two combination schemes are used: 
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Table 1. Average BER with individual model selection (column e classiG ) and our EA multi-model selection (column 
e EA ). Negative values in column AefAe = e EA - e dass J correspond to an improvement of the performance of a multi- 
class combination scheme when our EA multi-model selection method is used. 




Size 


500 


1000 




classic 


^"ea 


Ae 


classic 


<?EA 


Ae 


one-versus-one 


Satimage 


14.7 ± 1.8 % 


14.5 ± 2.1 % 


-0.2 % 


11.8 ± 0.9 % 


11.8 ± 1.0 % 


-0.0 % 


USPS 


12.8 ± 1.2 % 


11.0 ± 1.8 % 


-1.8 % 


8.9 ± 0.9 % 


8.4 ±1.6% 


-0.5 % 


Letter 


40.5 ± 3.0 % 


35.9 ± 2.9 % 


-4.6 % 


21.4 ±1.7% 


18.6 ±2.1% 


- 2.8 % 


one-versus-all 


Satimage 


14.6 ±1.7% 


14.5 ± 2.0 % 


-0.1 % 


11.5 ± 0.8 % 


11.6 ±1.0% 


+0.1 % 


USPS 


11.9 ± 1.3 % 


11.2 ± 1.5 % 


-0.7 % 


8.8 ± 1.3 % 


8.5 ±1.6% 


-0.3 % 


Letter 


41.9 ± 3.3 % 


36.3 ± 3.3 % 


-5.6 % 


22.1 ± 1.3 % 


19.7 ± 1.8 % 


-2.4 % 



one-versus-one and one-versus-all (Rifkin & Klautau, 
2004) 1 . For each binary problem, a SVM with Gauss- 
ian kernel K(u,v) = exp(— y||u - v|| 2 ) is trained (Vapnik, 
1998). Possible values of SVM hyper-parameters for 
a model are C trade-off SVM constant (Vapnik, 1998) 
and widthband y of gaussian kernel function (0. = (G, 
y.)). For all binary problems: 0. e 0. = [2~ 5 , 2~ 3 ,...,2 15 ] 
x [2~ 5 , 2 _3 ,...,2 15 ]. Individual space model 0. is based 
on grid search techniques (Chang & Lin, 2001). BER 
e on a multi-class problem and BER e. on binary sub- 
problems are estimated by five-fold cross-validation 
(CV). These BER values are used by our EA for the 
multi-model selection. Final BER e of a selected 
multi-model by our EA is estimated on a test datasets 
not used during the multi-model selection process. 
Our EA has several constants that must be fixed and 
we have made the following choices: p s = 2, X =50, 
n = 100, p =0.01. 

max y r m 

Table 1 gives average BER under all 20 split sets of 
previously mentioned datasets for each training set size 
(row size of table 1). This is done for the two combination 
schemes (one-versus-one and one-versus-all), and for 
the two above mentioned selection methods (columns 
Classic an d ^ea)* Column Ae provides the average varia- 
tion of BER between our multi-model selection and 
classical one. Results of that column are particularly 
important. For two datasets (USPS and Letter) our 



optimization method produces SVM combination 
schemes with best generalization capacities than the 
classical one. That effect appears to be more marked 
when number of classes in the multi-class problem 
increases. Areason is that the multi-model space search 
size exponentially increases with the number k of binary 
problems involved in a combination scheme (121 k for 
those experiments). This effect is directly linked to the 
number of classes n c and could explain why improve- 
ments are not measurable with Satimage dataset. In 
some way, a classical optimization method explores the 
multi-model space in blink mode, because cumulate 
effect of the combination of k SVM decision functions 
could not be determined without estimation of e. That 
effect is emphasized when estimated BER e. are poor 
(i.e. training and testing data size are low). Comparison 
of Ae values when training/testing dataset size change 
in table 1 illustrates this one. 



FUTURE TRENDS 

The proposed EA multi-model selection method has 
to be tested with other combination schemes (Rifkin 
& Klautau, 2004), like error-correcting output codes 
in order to measure their influence. Effect with others 
datasets, which have a great range in number of classes, 
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must also be tested. Adding feature selection (Frohlich, 
Chapelle & Scholkopf, 2004) abilities to our AE muti- 
model selection is also of importance. 

Another key point to take into account is the reduc- 
tion of the learning time of our EA method which is 
actually expensive. One way to explore this is to use 
fast CV error estimation technique (Lebrun, Charrier, 
Lezoray & Cardot, 2006) for the estimation of BER. 



CONCLUSION 

In this paper, a new E A multi-model selection method 
is proposed to optimize the generalization capacities 
of SVM combination schemes. The definition of a 
cross-over operator based on internal fitness of SVM 
on each binary problem is the core of our EA method. 
Experimental results show that our method increases 
the generalization capacities of one-versus-one and 
one-versus-all combination schemes when compared 
with individual model selection method. 



REFERENCES 

Beasley, D. (1997). Possible applications of evolution- 
ary computation. Handbook of Evolutionary Compu- 
tation. 97/1, A1.2. IOP Publishing Ltd. And Oxford 
University Press. 

Blacke, C, & Merz, C, (1998). UCI repository of ma- 
chine learning databases. Advances in Kernel Methods, 
Support Vector Learning. University of California, 
Irvine, Dept. of Information and Computer Sciences. 

Chang, C.-C, & Lin, C.-J. (2001). LIBSVM: a library 
for support vector machines. Software available at 
http://www.csie.ntu.edu.tw A~cjlin/libsvm. 

Dietterich, T. G., & Bakiri, G. (1995). Solving Multi- 
class Learning Problems via Error-Correcting Output 
Codes. Journal of AI Research. (2) 263-286. 

Duan, K.-B., & Keerthi, S. S. (2005). Which Ls the Best 
Multiclass SVM Method? An Empirical Study . Multiple 
Classifier Systems. 278-285. 

Frohlich, R, Chapelle, O., & Scholkopf, B. (2004). 
Feature Selection for Support Vector Machines Using 
Genetic Algorithms. International Journal on Artificial 
Intelligence Tools. 13(4) 791-800. 



Lebrun, G., Charrier, C, Lezoray, O., & Cardot, H. 
(2005). Fast Pixel Classification by SVM Using Vector 
Quantization, Tabu Search and Hybrid Color Space. 
Computer Analysis of Images and Patterns. (LNCS, 
Vol. 3691)685-692. 

Lebrun, G., Charrier, C, Lezoray, O., & Cardot, H. 
(2006). Speed-up LOO CVwith SVM classifier. Intel- 
ligence Data Engineering and Automated Learning. 
(LNCS, Vol. 4224) 108-115. 

Lebrun, G., Charrier, C, Lezoray, O., & Cardot, H. 
(2007). AnEAmulti-model selection for SVM multiclass 
schemes. Computational and Ambient Intelligence. 
260-267 (LNCS, Vol. 4507). 

Moreira, M., & Mayoraz, E. (1 998). LmprovedPairwise 
Coupling Classification with Correcting Classifiers. Eu- 
ropean Conference on Machine Learning. 160-171. 

Price, D., Knerr, S., Personnaz, L., & Dreyfus, G. 
(1994). Pairwise Neural Network Classifiers with 
Probabilistic Outputs. Neural Information Processing 
Systems. 1109-1116. 

Quost, B., Denoeux, T., & Masson, M. (2006). One- 
against-all classifier combination in the framework of 
belief functions. Information Processing and Manage- 
ment of Uncertainty in Knowledge-Based Systems. 
(1)356-363. 

Rechenberg, I. (1965). Cybernetic Solution Path of an 
Experimental Problem. Royal Aircraft Establishment 
Library Translation. 

Rifkin, R., & Klautau, A. (2004). In Defense ofOne- 
Vs-All Classification. Journal of Machine Learning 
Research. (5) 101-141. 

Vapnik, V.N. (1998). Statistical Learning Theory. 
Wiley Edition. 

Wu, T.-R, Lin, C.-J., & Weng, R. C, (2004). Probability 
Estimates for Multi-class Classification by Pairwise 
Coupling. Journal of Machine Learning Research. (5) 
975-1005. 



KEY TERMS 

Cross- Validation: A method of estimating predic- 
tive error of inducers. Cross-validation procedure splits 
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that dataset into k equal-sized pieces called folds, k 
predictive function are built, each tested on a distinct 
fold after being trained on the remaining folds. 

Evolutionary Algorithm (EA): Meta-heuristic 
optimization approach inspired by natural evolution, 
which begins with potential solution models, then it- 
eratively applies algorithms to find the fittest models 
from the set to serve as inputs to the next iteration, 
ultimately leading to a sub-optimal solution which is 
close to the optimal one. 

Model Selection: Model Selection for Support 
Vector Machines concerns the tuning of SVM hyper- 
parameters as C trade-off constant and the kernel 
parameters. 

Multi-Class Combination Scheme: Acombination 
of several binary classifiers to solve a given multiclas 
problem. 

Search Space: Set of all possible situations of the 
problem that we want to solve could ever be in. 



Support Vector Machine (SVM): SVM maps input 
data in a higher dimensional feature space by using a 
non linear function and finds in that feature space the 
optimal separating hyperplane maximizing the margin 
(that is the distance between the hyperplane and the 
closest points of the training set) and minimizing the 
number of misclassified patterns. 

Trade-Off Constant of SVM: The trade-off con- 
stant, noted C, permit to fix the importance to increase 
the margin for the selection of optimal hyper-plan 
in comparison with reducing predictive errors (i.e. 
examples which not respect margin distance from 
hyper-plan separator). 



ENDNOTE 

1 More details on used combinations schemes are 
given in (Lebrun, Lezoray, Charrier & Cardot, 
2007). 
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INTRODUCTION 

Throughout the last decades, one of society's concerns 
has been the development of new tools to optimize 
every aspect of daily life. One of the mechanisms that 
can be applied to this effect is what is nowadays called 
Artificial Intelligence (AI). This branch of science 
enables the design of intelligent systems, meaning that 
they display features that can be associated to human 
intelligence, search methods being one of the most 
remarkable. Amongst these, Evolutionary Computation 
(EC) stands out. This technique is based on the model- 
ling of certain traits of nature, especially the capacity 
shown by living beings to adapt to their environment, 
using as a starting point Darwin's Theory of Evolution 
following the principle of natural selection (Darwin, 
1859). These models search for solutions in an automa- 
tized way. As a result, a series of search techniques 
which solve problems in an automatized and parallel 
way has arisen. The most successful amongst these are 
Genetic Algorithms (GA) and, more recently, Genetic 
Programming (GP). The main difference between them 
is rooted on the way solutions are coded, which implies 
certain changes in their processing, even though the 
operation in both systems is similar. 

Like most disciplines, the field of Civil Engineer- 
ing is no stranger to optimization methods, which are 
applied especially to construction, maintenance or 
rehabilitation processes (Arciszewski and De Jong, 
2001) (Shaw, Miles and Gray, 2003) (Kicinger, Arcisze- 
wski and De Jong, 2005). For instance, in Structural 
Engineering in general and in Structural Concrete in 
particular, there are a number of problems which are 
solved simultaneously through theoretical studies, 
based on physical models, and experimental bench- 



marks which sanction and adjust the former, where 
a large amount of factors intervene. In these cases, 
techniques based on Evolutionary Computation are 
capable of optimizing constructive processes while 
accounting for structural safety levels. In this way, 
for each particular case, the type of materials, their 
amount, their usage, etc. can be determined, leading 
to an optimal development of the structure and thus 
minimizing manufacturing costs (Rabunal , Varela, 
Dorado, Gonzalez and Martinez, 2005). 



GENETIC ALGORITHMS 

At the origin of what is now known as Genetic Al- 
gorithms are the works of John Holland at the end of 
the 1960's. He initially named them "Reproductive 
Genetic Planning", and it wasn't until the 70's that 
they received the name under which they are known 
today (Holland, 1975). 

GAis a search algorithm inspired on the biological 
functioning of living beings. It is based upon reproduc- 
tive processes and the principle which determines that 
better environmetally adapted individuals have more 
chances of surviving (Goldberg, 1989). 

Like living beings, GAs use the basic heritage unit, 
the gene, to obtain a solution to a problem. The full 
set of genes (parameters characterizing the problem) is 
chromosome, and the expression of the chromosome 
is an individual in particular. 

In Computer Science terms, the representation of 
each individual is a chain, usually binary, assigning 
a certain number of bits to each parameter. For each 
variable represented a conversion to discrete valued has 
to be performed. Obviously, not all parameters have 
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to be coded with the same number of bits. Each one 
of the bits in a gene is usually called allele. Once the 
individuals' genotype (the structure for the creation of 
an individual in particular) is defined, we are ready to 
carry out the evolutionary process that will reach the 
solution to the problem posed. 

We start with a random set of individuals, called 
a population. Each of these individuals is a potential 
solution to the problem. This would be the initial popu- 
lation or zero generation, and successive generations 
will be created from it until the solution is reached. 
The mechanisms used in the individuals' evolution are 
analogous to the functioning of living beings: 

Selection of individuals for reproduction. All 
selection algorithms are based on the choice of 
individuals by giving higher survival probabili- 
ties to those which offer a better solution to the 
problem, but allowing worse individuals to be 
also selected so genetic diversity will not be lost. 
Unselected individuals will be copied through to 
the following generation. 
Once the individuals have been selected, cross- 
over is performed. Typically, two individuals 
(parents) are crossed to produce two new ones. A 
position is established before which the bits will 
correspond to one parent, with the rest belonging 
to the other. This crossover is named single point 
crossover, but a number of points could be used, 
where bit subchains separated by points would 
belong alternatively to one or the other parent. 
Once the new individuals have been obtained, 
small variations in a low percentage of them are 
performed. This is called mutation, and its goal 
is to carry out an exploration in the state space. 

Once the process is over, the new individuals will 
be inserted in the new population, constituting the next 
generation. 

New generations will be produced until the popula- 
tion has created a sufficiently adequate solution, the 
maximum number of generations has been reached, 
or the population has converged and all individuals 
are equal. 



GENETIC PROGRAMMING 

Genetic Programming (GP), like GAs, is a search 
mechanism inspired on the functioning of living beings. 
The greater difference between both methods consists 
in the way solutions are coded. In this case, it is carried 
out as a tree structure (Koza, 1990) (Koza, 1992). The 
main goal of GP is to produce solutions to problems 
through program and algorithm induction. 

The general functioning is similar to that of the GAs. 
Nevertheless, due to the difference in solution coding, 
great variations exist in the genetic operations of initial 
solution generation, crossover and mutation. The rest 
of operations, selection and replacement algorithms, 
remain the same, as do the metrics used to evaluate 
individuals (fitness). 

We will now describe two cases where both tech- 
niques have been applied. They refer to questions 
related to Structural Concrete, approached to as both 
a material and a structure. 

EXAMPLE 1 : Procedure to determine optimal mixture 
proportion in High Performance Concrete. 

High Performance Concrete (HPC) is a type of concrete 
designed to attain greater durability together with high 
resistance and good properties in its fresh state to al- 
low ease of mixing, placing and curing (Forster, 1994) 
(Neville and Aitcin, 1998) . Its basic components are 
the same of ordinary concrete, with the intervention in 
diverse quantities of additions (fly ash or silica fume, 
byproducts of other industries that display pozzolanic 
resistance capacities) plus air-entraining and/or flu- 
idifying admixtures. Indeed, an adequate proportion 
of these components with quality cement, water and 
aggregates produces concrete with excellent behavior, 
generally after an experimental process to adjust the 
optimal content in each material. When very high re- 
sistance is not a requirement, the addition introduced is 
fly ash (FA) ; air-entraining admixtures (AE) are used 
to improve behavior in frost/defrost situations. When 
high resistance is needed, FA is substituted by a silica 
fume (SF) addition, eliminating AE altogether. In every 
case high cement contents and low water/binder (W/B) 
ratios are used (both cement and pozzolanic additions 
are considered binders). 

A number of mixture proportioning methods exist, 
based on experimental approaches and developed by 
different authors. The product of such mixtures can 
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be controlled through various tests, two of which are 
particularly representative of the fresh and hardened 
states of concrete: the measurement of workability and 
evaluation of compressive strength, respectively. The 
first one is carried out through the slump test, where 
the subsidence of a molded specimen after removal of 
the mold is measured. A high value ensures adequate 
placing of concrete. The second test consists on the 
compression of hardened concrete until failure is 
produced; HPC can resist compressive stresses in the 
range of 40 to 120 MPa. 

The goal of mixture proportioning methods is to 
adjust the amount of each component to obtain slump 
and strength values within a chosen range. There is a 
large body of experience, and basic mixtures which 
require some experimental adjustment are available. 
It is difficult to develop a theoretical model to predict 
slump and resistance of a particular specimen, though 
roughly adequate fitted curves do exist (Neville, 1981). 
It is even harder to approach theoretically the reverse 
problem, that is, to unveil mixture proportioning starting 
from the goal results (slump and strength). 

Chul-Hyun Lim et al. (Chul-Hyun, Young-Soo and 
Joong-Hoon, 2004) have developed a GA based appli- 
cation to determine the relationship between different 
parameters used to characterize a HPC mixture. Two 
types of mixture are considered regarding their goal 
strength: mixtures that reach values between 40 and 
80 MPa and mixtures that reach between 80 and 120 
MPa. If a good database is available, it is not hard to 
obtain correlations between different variables to predict 
slump and strength values for a particular specimen. 
Nevertheless, the authors use GAs to solve the reverse 
problem, that is, to obtain mixture parameters when 
input data are slump and strength. 

To that effect they use a database with 104 mixtures 
in the 40 to 80 MPa range and 77 in the 80 to 120 MPa 
range. In the first group, essential parameters are, ac- 
cording to prior knowledge, W/B ratio (%), amount of 
water (W, kg/m 3 ), fine aggregate to total aggregate ratio 
(s/a, %), amount of AE admixture (kg/m 3 ), amount of 
cement replaced by FA (%), and amount of high-range 
water-reducing admixture (superplasticizer, SP, kg/m 3 ). 
Tests on the different mixtures give slump values (be- 
tween and 300 mm) and compressive strength. In the 
second group, essential parameters are W/B, W, s/a, 
amount of cement replaced by silica fume (SF, %) and 
SP. Tests give out slump and resistance values. 



Using multiple regression techniques, the authors 
firstly obtain for each group two fitted curves group 
that predict slump and strength from starting variables. 
GAs are used to solve the reverse problem. For each 
group, they first reach an individual which determines 
optimal W/B, W, s/a, FA and AE, or W/B, W, s/a and 
SF for a specific compressive strength. From these 
parameters, using the prediction curve previously 
obtained, the optimal SP value for a particular slump 
is calculated. 

The development of genetic algorithms used is 
based on programs by Houck et al. (Houck, Joines, 
and Kay 1996) In this case, "ranking selection based 
on normalized geometric distribution" has been used 
for individual selection. One-point, two-point, and uni- 
form crossover algorithms are used. Different strategies 
are used for mutation operations: boundary mutation, 
multi-nonunif orm mutation, nonuniform mutation, and 
uniform mutation. The first trials were carried out with 
an initial population consisting of only 15 individuals, 
which lead to local minima only. Optimal results are 
obtained increasing population to 75 individuals. 

It should be pointed out that the reverse problem is 
thus solved in two phases. In the first phase all compo- 
nent amounts are fixed except for SP, with compressive 
strength as a target; following this, SP is fixed to attain 
the desired slump. Initial fitting functions are used as 
a simple but accurate approach to the real database, to 
avoid using it directly, which would make solving the 
reverse problem a more difficult task. 

For the first group, highest errors correspond to 
AE and SP determination, up to 12.5% and 15% re- 
spectively. Errors committed in the second group are 
smaller. In any case, errors are relatively minor since 
these materials are included as admixtures in very small 
amounts. As a conclusion for this example, it is interest- 
ing to point out that the procedure is not only a useful 
application of GAs to concrete mixture proportioning, 
but also constitutes by itself a new mixture proportion- 
ing method that requires GAs for its application. 

EXAMPLE 2: Determination of shear strength in re- 
inforced concrete deep beams 

Deep beams are those that can be considered short (span, 
L) in relation to their depth (h). There is no consensus 
in international codes as to what is the span-to-depth 
threshold value dividing conventional and deep beams. 
In the example shown here, L/h ratios between 0.9 and 
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Figure 1. Deep beam parameters 
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4.14 are considered for simply supported elements 
working under two point loads as shown in Figure 
1. Shear failure is a beam collapse mode produced 
by the generation of excesssive shear stresses in the 
vicinity of the supports. Its value depends on various 
parameters commonly used for beam design (see Fig- 
ure 1): a/h, L/h, section area (A sv and A sh ), strength (f 
and f h ) and distances between vertical and horizontal 
steel rebars (s y y s h ) placed in the failure zone, which 
will be respectively called vertical and horizontal 
transverse reinforcement; section area (A sb and A st ) 
and strength (f b and f ) of longitudinal rebars placed 
in the lower and higher zones of the beam, which will 
be respectively called bottom and top reinforcement; 
compressive strength of concrete used in the beam, 
and width (b) of the latter. 

The dependence of shear strength on these param- 
eters is known through multiple studies (Leonhardt and 
Walther, 1 970) and it can be presented in a normalized 
form through simple relationships. Relevant parameters 
are reduced to the following: 

X^a/h 
X 2 =L/h 
X =(A f )/(bs f ) 

3 v sv yv 7 v v c 7 

X 4 =(A sh f yh )/(bs h f c ) 
X 5 =(A sb f yb )/(bhf c ) 
X 6 =(AX)/(bhf c ) 



Normalized shear strength can be written as R=P/ 
(bhf c ), where P is failure load (Figure 1). 

Ashour et al. (Ashour, Alvarez and Toropov, 2003) 
undertake the task of finding an expression that can 
predict shear strength from these variables. It is not easy 
to develop an accurate mathematical model capable of 
predicting shear strength. Usual curve fitting techniques 
do not produce good results either, though they have 
been the base of diverse international codes that include 
design prescriptions for these structural elements. GP 
appears to be a tool that can reach consistent results. 
The authors develop various expressions with differ- 
ent complexity levels, obtaining different degrees of 
accuracy. A database of 141 tests available in scientific 
literature is used. 

A remarkable feature of GP techniques is that if the 
possibility of using complex operators is introduced, 
fitting capacity is impressive when variables are well 
distributed and their range is wide. Notwithstanding, if 
the goal of the process is to obtain an expression that 
can be used by engineers, one of its requirements is 
simplicity. The authors of this study take into account 
this premise and only choose as operators addition, 
multiplication, division and squaring. Afirst application 
leads to a complex expression that reveals the very low 
influence of parameter X 2 , which is thus eliminated. 
With the 5 remaining variables, a simple and accurate 
expression is obtained (root mean square -RMS- train- 
ing error equal to 0.033; average ratio between predicted 
and real R equal to 1.008 and standard deviation equal 
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to 0.23). Validation of this expression is performed 
whit 15 new tests and also found to be accurate: RMS 
error equal to 0.035; average ratio between predicted 
and real R values equal to 1.11 and standard deviation 
equal to 0.21. 

As a conclusion to this example, it can be pointed 
out that expressions obtained through GP bring as an 
added value that they can become virtual laboratories. 
Indeed, by fixing one or multiple variables, the influ- 
ence on the response of the variation of a specific one, 
and also determine which variables are most important 
in the studied phenomenon (in this case, X x and X 5 ). 
Finally, GP techniques prove to be a powerful tool 
for the development and improvement of codes that 
regulate concrete structure design. Even when the 
precise physical mechanism is unknown, GP allows for 
the development of accurate expressions that can be 
factored a posteriori to reach safety levels associated 
to an acceptable failure probability. 



CONCLUSION 

Evolutionary Computation is a valid technique for 
optimization and regression problems in the fields 
of Structural Engineering in general and Structural 
Concrete in particular. 

In the first of the two examples analyzed, GAs have 
been used to determine optimal mixture proportion 
for High Performance Concrete, using as target data 
its compressive stregth and workability. In this case, 
evolutionary techniques show their power by solving 
the reverse problem, producing a new mixture propor- 
tioning method for concrete. 

In the second example, GP techniques were used 
to accurately predict structural response of concrete 
beams from benchmark experimental data series. The 
advantage brought forth by Evolutionary Computation 
is the capacity to analyze physically complex phe- 
nomena, by creating a "virtual laboratory". The line 
of work opened towards the improvement of design 
codes and rules sometimes purely based on testing is 
also of great importance. 

Application of EC techniques is growing expo- 
nentially in this field thanks to fruitful collaboration 
between EC and Structural Engineering experts. 
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KEY TERMS 

Compressive Strength: The measured maximum 
resistance of a concrete or mortar specimen to axial 
compressive loading; expressed as force per unit cross- 
sectional area; or the specified resistance used in design 
calculations. 

Deep Beam: A flexural member whose span-to- 
depth ratio is too low to accurately apply the principles 
of sectional design through sectional properties and 
internal forces shear strength the maximum shearing 
stress a flexural member can support at a specific lo- 
cation as controlled by the combined effects of shear 
forces and bending moment 

High Performance Concrete: Concrete meeting 
special combinations of performance and uniformity 
requirements that cannot always be achieved routinely 
using conventional constituents and normal mixing, 
placing, and curing practices. 

Mixture Proportion: The proportions of ingredi- 
ents that make the most economical use of available 
materials to produce mortar or concrete of the required 
properties. 

Slump: A measure of consistency of freshly mixed 
concrete, mortar, or stucco equal to the subsidence 
measured to the nearest 1/4 in. (6 mm) of the molded 
specimen immediately after removal of the slump 
cone. 

Superplasticizer or High-Range Water-Reducing 
Admixture: A water-reducing admixture capable of 
producing large water reduction or great flowability 
without causing undue set retardation or entrainment 
of air in mortar or concrete. 

Workability: That property of freshly mixed con- 
crete or mortar that determines the ease with which it 
can be mixed, placed, consolidated, and finished to a 
homogenous condition. 
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INTRODUCTION 

E-learning and the impact of new technologies across 
contemporary life is a very significant field to educa- 
tion. The challenge of the technology to conventional 
learning patterns cannot be ignored and in itself raises 
a host of questions: can online learning facilitate deep 
learning? How well does video conferencing alleviate 
the challenge of distance? In what ways can collabora- 
tive learning communities be developed and sustained 
using current and new technologies? At the same time, 
new communications technologies are impacting on the 
ways in which we understand ourselves and the worlds 
in which we live. Relating to this, the aim of today's 
education is not to learn certain contents, but rather 
learn to learn in the course of a whole lifetime. 

The study of the learning process can help us to find 
the relevant points to set up some interesting character- 
istics of a really functional e-learning system. 



THE LEARNING PROCESS 

The learning process consists of a modification of our 
conduct that, by extracting knowledge from acquired 
experience, enables us to tackle problems (Pedreira, 
2004a). This definition highlights the two basic aspects 
of all learning processes: knowledge acquisition, and 
the experience that leads to it. 

Most studies on the nature of knowledge agree on 
the fact that knowledge is at the top of the hierarchical 
structure called information. According to this vision, 
data represent facts or concepts in a formalised way that 
allows their communication, interpretation or elabora- 
tion by human beings or by automatic means (syntactic 
level of the information). The so-called "news" is the 



meaning that an intelligent being attaches to data based 
on the conventional rules used for their representation 
(semantic level). Knowledge implies the judgement of 
facts and situations, and consists of inferred data and 
news, tacit relations between objects, concepts, events 
and situations, and of the necessary control actions to 
manage all these elements in an effective way. As such, 
knowledge concerns the pragmatic aspect of informa- 
tion because it combines the received news with the 
knowledge that the observer already possesses. 



EDUCATION IN KNOWLEDGE SOCIETY 

In recent years, so many changes have affected education 
that education itself needs to be updated. The amount 
of knowledge that we deal with is much bigger than 
before, the interrelations between different forms of 
information are much more complex, and the sources 
are dispersed. Such being the case, the linear model, 
in which each question has a place and a moment, is 
no longer adequate for today's information. Logical 
hierarchies are replaced by multiple and simultaneous 
media that respond to the needs of the knowledge proc- 
ess. The inevitable increase in complexity and quantity 
of the information that is available and necessary has 
led to a need for continuous learning. 

Furthermore, in modern society, knowledge is not 
exclusively related to education. We live in what is 
called the "information or knowledge society", where 
the possession of knowledge is a determining factor. 

Knowledge handling requires a profound transfor- 
mation of learning and teaching methods : from a model 
in which the teacher is the monopolising agent and 
the authorised representative of knowledge, we must 
move towards a model that offers the student room for 
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individual exploration and self-learning. The student 
needs to build relations, discover the process from 
within, and feel stimulated to draw his own roadmap 
(Piaget, 1999). 

This kind of learning can only be obtained through 
action strategies that are not perceived as restricting 
obligations but rather as interesting learning options. 
Contents, for instance, should be represented not as 
an object of study but rather as necessary elements 
towards a series of obj ectives that will be discovered in 
the course of various tests. Computer games apply the 
same strategy by making their users learn to proceed 
from one phase to another based on obtained experi- 
ence and improved dexterity. This way they keep users 
entertained for hours in a row by trial and error. 

Besides, students come from different environments 
and have different ages and education backgrounds, 
which make it more complicated to integrate them 
into one single group. Real personalised attention 
would require many more teachers and much more 
time. Add to that the increasing demand for continu- 
ous education, with flexible timetables and subjects, 
and it becomes clear that the current programmes are 
much too rigid. 

The advantages of e-learning include convenience 
and portability (physical and temporal flexibility), cost 
and selection (wide range of courses and prices, dif- 
ferent levels), individualisation and a higher level of 
student implication (WorldWideLearn, 2007). 

However, if the contents of the learning platforms 
remain the same as those of traditional systems, even 
if their presentation format is adapted, they do not 
substantially contribute to the improvement of the 
learning process (Martinez, 2002). The same happens 
with the use of computational systems that support ex 
cathedra teaching and improve the acquisition of certain 
skills, such as simulators and games. Simulators can 
only be used when certain concepts are already clearly 
understood, and in most cases, their interface is quite 
complicated. Computer games are mostly used for 
concrete aspects and in elementary courses. 

Instructional Design for e-Learning has been per- 
fected and refined over many years using established 
teaching principles, with many benefits to students, but 
it is necessary to go on with the studies on this area 
because the results are still not as good as desired. 



NEW TECHNOLOGIES PROPOSAL 

Even so, current communication technologies, includ- 
ing Artificial Intelligence, allow the implementation of 
learning strategies based on action (e.g. videogames), 
the incorporation of systems that improve knowledge 
management (Wiig, 1995), the recuperation of the 
one-to-one learning model (master-apprentice becomes 
teacher-student), and the implementation of a new 
learning model ("many teachers for one student"). A 
computer model including all these characteristics can 
be a solid basis for the improvement of the learning 
process and the existing e-learning systems. It could 
teach the students more than just certain contents: it 
could teach them how to learn, by selecting and sharing 
the adequate information in each moment. 

In this point, we will remark some pedagogical 
characteristics of e-learning computer models which 
are known to improve the learning process. For each 
of these characteristics we propose a feature that can 
be implemented by using New Technologies. 

Pedagogical Characteristic 1 

Dealing with information of different sources will al- 
low the students seeing different points of view of the 
same realities, making easy its understanding and its 
conservation in mind. 

New Technologies Feature 1 

In the Institutional Memory of a Knowledge Manage- 
ment System, we will find all the information concerning 
every thematic unit, different levels and its associated 
tasks. The fact of being able to solve different tasks 
and having the access into information of different 
sources allows the learner to acquire the information 
by different means, so that his knowledge will be more 
complete and everlasting. 

Pedagogical Characteristic 2 

An e-learning model should provide an individual at- 
tention, taking into account the student's preferences 
about learning strategies, different kind of materials, 
their previous knowledge, etc. 
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New Technologies Feature 2 

In order to get it, some intelligent agents can take 
charge of selecting and showing, in any case, the suit- 
able information (from the information repository) 
according to the preferences and level of each student. 
These agents can perform different tasks, and divide 
them between the users ' computers and the server where 
the Institutional Memory is stored. 

Pedagogical Characteristic 3 

An e-learning model must facilitate the students all the 
available information, in different formats and coming 
from different sources, for the students to learn how to 
choose the most relevant elements for their learning. 

New Technologies Feature 3 

To reach this, it can be used a global ontology, estab- 
lishing a classification in levels and the relationships 
between the available information, managed trough 
the Knowledge Management System. 

Pedagogical Characteristic 4 

Computer e-learning models ought to propose this 
apprenticeship by means of works and problems to 
solve, so that the students knowledge grows as they 
go on with the resolution of their works. 

New Technologies Feature 4 

It is necessary to establish much different kind of 
works, at different levels, for each unit the student 
must prepare. The tasks and the available information 
as well, will be founded in the Institutional Memory 
organized under ontology, making easy the access to 
the relevant information at any moment. Carrying out 
the tasks, the learner will build his own knowledge 
(Nonaka, 1995). 

Pedagogical Characteristic 5 

An e-learning model should join strategies to get 
and raise the students motivation and encourage its 
inquisitiveness, relating the available information with 
its interest, proposing the possibility of explore deeper 



the same or related subjects and using the computer 
games strategies that give rise to investigate. 

New Technologies Feature 5 

When the contents of the course are part of an Institu- 
tional Memory, the existence of a global ontology can 
facilitate the display of the elements remarking the con- 
nections between them. Besides, as alleged previously, 
the use of intelligent agents allows us to show these 
connections according to the individual preferences. 
The strategies utilized in computer games, including 
the apprenticeship trough the action will help to attract 
and maintain the student's interest. 



FUTURE WORKS 

Some prototypes for the aspects mentioned in the previ- 
ous point have been developed in our research labora- 
tory for testing the proposed features (Pedreira, 2004a, 
2004b, 2005a, 2005b). Each of them has reached quite 
good results. These approximations show that the use of 
New Technologies on education allows the students to 
extend or improve their problem-solving methods and 
their abilities to transfer knowledge (Friss de Kereki, 
2004). After these first approaches, we are working on 
the joint of the prototypes and their enlargement with 
some characteristics that have not still been tested. 



CONCLUSION 

In this article we suggest several features that e-learning 
systems should have in order to improve online learn- 
ing, which can be achieved by using New Technolo- 
gies. In short, we propose a computer model based on 
a Knowledge Management System which, by using 
a global ontology, maintains the highest quantity of 
relationships between the available information and 
its classification at different levels. By means of this 
support of knowledge, apprenticeship can be established 
by means of task proposal, based on computer game 
strategies. Using the philosophy of intelligent agents, 
these systems can interact with the students showing 
them the information according to their preferences, in 
order to motivate them and to stimulate their capacity 
of raising questions. 
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KEY TERMS 

Best Practice: A management idea which asserts 
that there is a technique, method, process, activity, 
incentive or reward that is more effective at delivering 
a particular outcome than any other technique, method 
or process. 

Computer Game: A video game played on a per- 
sonal computer, rather than on a video game console 
or arcade machine. 

Computer Model: A computer program that at- 
tempts to simulate an abstract model of a particular 
system. 

E-Learning: Learning that is accomplished over the 
Internet, a computer network, via CD-ROM, interactive 
TV, or satellite broadcast 

Intelligent Agent: A real time software system that 
interacts with its environment to perform non-repetitive 
computer-related tasks. 

Knowledge Management: The collection, orga- 
nization, analysis, and sharing of information held by 
workers and groups within an organization. 

Learn to Learn: In this context, learn to manage 
(select, extract, classify) the great amount of informa- 
tion existing in actual society, in order to identify real 
and significant knowledge. 

New Technologies: In this context, Computer, 
Information and Communication Technologies. 

Virtual: Not physical. 
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INTRODUCTION 

The world of Virtual Environments and Immersive 
Technologies (Sutherland, 1965) (Kalawsky, 1993) 
are evolving quite rapidly. As the range and complex- 
ity of applications increases, so does the requirement 
for intelligent interaction. The now relatively simple 
environments of the OZ project (Bates, Loyall & Reilly, 
1992) have been superseded by Virtual Theatres (Doyle 
& Hayes-Roth, 1 997) (Giannachi, 2004), Tactical Com- 
bat Air (Jones, Tambe, Laird & Rosenbloom, 1993) 
training prototypes and Air Flight Control Simulators 
(Wangermann& Stengel, 1998). 

This article presents a brief summary of present 
and future technologies and emerging applications 
that require the use of AI expertise in the area of 
immersive technologies and virtual environments. 
The applications are placed within a context of prior 
research projects. 



BACKGROUND 

Visualisation is defined as the use of computer-based, 
interactive visual representations of data to amplify 
cognition. The much cited process driven visualisation 
pipeline proposed by Upson et al (1989) is shown in 
Figure 1 . Upson and his colleagues define three pro- 
cesses consisting of filtering, mapping and rendering 
the data. The image presented allows the user to draw 
some inference and gain insight into the data. 



The Filter process is when data of interest are derived 
from the raw input data; for example, an interpolation 
of scattered data onto a regular grid. This data is then 
Mapped into geometric primitives that can be then be 
Rendered and displayed as an image to the user. The 
user may then gain an improved understanding and 
greater insight into the original raw data. The type of 
data and application area heavily influence the nature 
of the mapping process. That is, choosing the actual 
visualisation technique that we are going to use. For 
example, if the data consisted of ID scalar data, then a 
simple line graph can be used to represent the data. If 
the filtered data consists of 3D scalar data, then some 
form of 3D isosurfaces or direct volume rendering 
technique would be more appropriate. Through the 
various specifications and conceptualisations of the 
filter-map pipeline above, we would propose an ontol- 
ogy that describes the relationships between data type 
and mapping processes that facilitates the automatic 
selection of visualisation techniques based on the raw 
data type. As the applications become more sophisti- 
cated the visualisation process can make use of the data 
ontology to drive AI controlled characters and agents 
appropriate for the application and data. 

A starting place for this can be seen in the area 
of believable agents (Bates, Loyall & Reilly, 1992) 
where the research ranges from animation issues to 
models of emotion and cognition, annotated environ- 
ments. Innovative learning environments and Animated 
Pedagogical Agents (Johnson, Rickel & Lester, 2000) 
provide further areas for development, as do industrial 



Figure L Upson et al's visualisation pipeline 
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applications, for example Computer Numerical Control 
(CNC) milling operations virtual training prototype 
system (Lina, Yeb, Duffy & Sue, 2002). Teaching en- 
vironments using multiple interface devices in virtual 
reality include Steve (Soar Training Expert for Vir- 
tual Environments) that supports the learning process 
(Rickel & Johnson, 1 999) with collaborators including 
Lockheed Martin AI Center. SOAR (Laird, Hucka & 
Huffman, 1991) has also been used for training simula- 
tion in air combat (TacAir-Soar) (Jones, Tambe, Laird 
& Rosenbloom, 1993). This autonomous system for 
modelling the tactical air domain brings together areas 
of AI research covering cognitive architectures Human 
Behavior Representation (HBR) / Computer Generated 
Forces (CGF). SOF-Soar: (Special Operations Forces 
Modeling) (Tambe, Johnson, Jones, Koss, Laird, Rosen- 
bloom and Schwamb, 1995) uses the same underlying 
framework and methods for behavior generation based 
on the Soar model of human cognition, each entity is 
capable of autonomous, goal-directed decision-mak- 
ing, planning, and reactive, real-time behavior. As the 
world of digital media expands emerging applications 
will draw on the worlds of Virtual Theatres (Giannachi, 
2004), interactive storyline games and others forms of 
entertainment (Wardrip-Fruin & Harrigan, 2004) to 
enhance the visualisation experience, especially where 
virtual worlds involving human artifacts and past and 
current civilisations are involved. 



SWARM INTELLIGENCE FOR 
VISUALISATION 

Understanding the behaviour of biological agents 
in their natural environment is of great importance 
to ethologists and biologists. Where these creatures 
move in large numbers is a challenge for orthodox 
visualisation. 

Working with marine biologists, a 3D model of large 
numbers of swarming krill (Figure 2) has been created. 
The model augments the classic swarming functions of 
separation, alignment and cohesion outlined by Reyn- 
olds (1987). The generated 3D model allows cameras 
to be placed on individual krill in order to generate an 
in-swarm perspective. New research on Antarctic krill 
(Tarling & Johnson, 2006) reveals that they absorb and 
transfer more carbon from the Earth's surface than 
was previously understood. Scientists from the British 
Antarctic Survey (BAS) and Scarborough Centre of 
Coastal Studies at the University of Hull discovered 
that rather than doing so once per 24 hours, Antarctic 
krill 'parachute' from the ocean surface to deeper lay- 
ers several times during the night. In the process they 
inject more carbon into the deep sea when they excrete 
their waste than had previously been understood. Our 
objective has been to provide marine biologists with 
a visualisation and statistical tool that permits them to 
change a number of parameters within the krill marine 
environment and examine the effects of those changes 
over time. The software can also be used as a teaching 
tool for the classroom at varying academic levels. 




Figure 2. 3D krill and sample 3D swarm 
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The marine biologist may modify parameters re- 
lating to an individual krill's field of view, foraging 
speed, collision avoidance, exhaustion speed, desire 
for food etc. The researcher may also modify more 
global parameters relating to sea currents, temperature 
and quantity and density of algae (food) etc. 

Upon execution, the biologist may interact with the 
model by changing the 3D viewpoint and modifying the 
time-step which controls the run speed of the model. 
Krill states are represented using different colours. 
For example, a red krill represents a starved unhealthy 
krill, green means active and healthy, blue represents 
a digestion period where krill activity is at a minimal. 
Recent advances in processor and graphics technology 
means that these sorts of simulations are possible using 
high specification desktop computers. Marine biolo- 
gists have been very interested in seeing how making 
minor changes to certain variables, such as field of view 
for krill, can have major consequences to the flocking 
behaviour over time of the entire swarm. 



Figure 3. VR paragliding simulator 




PARAGLIDING SIMULATOR 



The Department of Computer Science at the University 
of Hull have recently developed the world's first ever 
paragliding simulator (Sim Vis, 2007). The system 
provides a paragliding pilot with a virtual reality im- 
mersive flying experience (Figure 3). As far as the user 
is concerned, they are flying a real paraglider. They sit 
in a real harness and all physical user inputs are the 
same as real life. Visuals are controlled via a computer 
and displayed to the user via a head-tracked helmet 
mounted display. The simulator accurately models 
winds (including thermals andupdrafts), photorealistic 
terrain and other computer controlled AI pilots. 

Figure 4 shows a typical view from the paragliding 
simulator. To the right of the image, the user can see 
four paragliding pilots circling a thermal. It is in the 
user's interests to fly to this region to share in the uplift, 
gaining altitude and therefore flight time. 

This prototype is being developed along the lines 
of SOF-Soar: (Special Operations Forces Modeling) 
(Tambe, Johnson, Jones, Koss, Laird, Rosenbloom and 
Schwamb, 1995). It requires an expert tutoring system 
encompassing the knowledge of expert pilots. Like 
some virtual learning environments it needs to build 
a trainee profile from a default and adapt to the needs 



of the novice flyer. For example the system can create 
an AI Pilot flying directly at the user forcing the user 
to practise collision avoidance rules. If pilots collide 
in the simulator, both pilots will become wrapped in 
each others canopies and they will plummet down to 
the earth and die. As virtual fly-time accumulates the 
system needs to adapt to the changing profile of the 
user. An expert system could be used to force the user 
to make certain flight manoeuvres that test the user's 
knowledge of CAA air laws. For example, the AI 
system may decide that the user is an advanced pilot 
due to their excellent use of thermal updrafts etc. The 
system therefore works out how to put our pilot into 
a compromising situation that would test their skills 
and ability such as plotting a collision course with our 
pilot when they are flying alongside a cliff edge. If our 
pilot is a novice, then the system would present simpler 
challenges such as basic collision avoidance. This is an 
advanced knowledge engineering project that blends 
traditional AI areas such as knowledge engineering with 
more nouvelle fields such as agent based reactive and 
cognitive architectures and state of the art visualisation 
and immersive technologies. 
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Figure 4. Real-time pilots view from the paragliding simulator including other AI controlled pilots 





CONCLUSION 

This article suggests that cognitive science and artificial 
intelligence have a major role to play in emerging inter- 
face devices that require believable agents. As virtual 
environments and immersive technologies become ever 
more sophisticated, the capabilities of the interacting 
components will need to become smarter, making use 
of artificial-life, artificial intelligence and cognitive 
science. This will include the simulation of human 
behaviour in interactive worlds whether in the model- 
ling of the way data and information is manipulated or 
in the use of hardware devices (such as the paraglider) 
within a virtual environment. 
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KEY TERMS 

Air Flight Control: A service provided by ground- 
based controllers who direct aircraft on the ground and in 
the air. A controller 's primary task is to separate certain 
aircraft — to prevent them from coming too close to 
each other by use of lateral, vertical and longitudinal 
separation. Secondary tasks include ensuring orderly 
and expeditious flow of traffic and providing informa- 
tion to pilots, such as weather, navigation information 
and NOTAMs (Notices to Airmen) 

Flocking: A computer model of coordinated animal 
motion such as bird flocks and fish schools. Typically 
based on three dimensional computational geometry 
of the sort normally used in computer animation or 
computer aided design. 

Knowledge Engineering: Knowledge engineering 
is a field within artificial intelligence that develops 
knowledge-based systems. Such systems are computer 
programs that contain large amounts of knowledge, 
rules and reasoning mechanisms to provide solutions to 
real-world problems. Amaj or form of knowledge-based 
system is an expert system, one designed to emulate 
the reasoning processes of an expert practitioner (i.e. 
one having performed in a professional role for very 
many years). 

Virtual and Immersive Environments: Virtual 
environments coupled with immersive technologies 
provide the sensory experience of being in a computer 
generated, simulated space. They have potential uses 
in applications ranging from education and training to 
design and prototyping. 

Virtual Theatres : The concept of "Virtual Theatre" 
is vague and there seems to be no commonly accepted 
definition of the term. It can be defined as a virtual world 
inhabited by autonomous agents that are acting and 
interacting in an independent way. These agents may 
follow a predetermined manuscript, or act completely 
on their own initiative. 
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INTRODUCTION 

The use of speech in human-machine interaction is in- 
creasing as the computer interfaces are becoming more 
complex but also more useable. These interfaces make 
use of the information obtained from the user through 
the analysis of different modalities and show a specific 
answer by means of different media. The origin of the 
multimodal systems can be found in its precursor, the 
"Put-That-There" system (Bolt, 1980), an application 
operated by speech and gesture recognition. 

The use of speech as one of these modalities to get 
orders from users and to provide some oral informa- 
tion makes the human-machine communication more 
natural. There is a growing number of applications that 
use speech-to-text conversion and animated characters 
with speech synthesis. 

One way to improve the naturalness of these inter- 
faces is the incorporation of the recognition of user's 
emotional states (Campbell, 2000). This point gener- 
ally requires the creation of speech databases showing 
authentic emotional content allowing robust analysis. 
Cowie, Douglas-Cowie & Cox (2005) present some 
databases showing an increase in multimodal data- 
bases, and Ververidis & Kotropoulos (2006) describe 
64 databases and their application. When creating 
this kind of databases the main arising problem is the 
naturalness of the locutions, which directly depends on 
the method used in the recordings, assuming that they 
must be controlled without interfering the authenticity 



of the locutions. Campbell (2000) and Schroder (2004) 
propose four different sources for obtaining emotional 
speech, ordered from less control but more authentic- 
ity to more control but less authenticity: i) natural 
occurrences, ii) provocation of authentic emotions 
in laboratory conditions, iii) stimulated emotions by 
means of prepared texts, and iv) acted speech reading 
the same texts with different emotional states, usually 
performed by actors. 

On the one hand, corpora designed to synthesize 
emotional speech are based on studies centred on the 
listener, following the distinction made by Schroder 
(2004), because they model the speech parameters 
in order to transmit a specific emotion. On the other 
hand, emotion recognition implies studies centred on 
the speaker, because they are related to the speaker 
emotional state and the parameters of the speech. The 
validation of a corpus used for synthesis involves both 
kinds of studies: the former since it will be used for 
synthesis and the latter since recognition is needed to 
evaluate its content. The best validation system is the 
selection of the valid utterances 1 of the corpus by hu- 
man listeners. However, the big size of a corpus makes 
this process unaffordable. 



BACKGROUND 

Emotion recognition has been an interesting research 
field in human-machine interaction for long, as can be 
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observed in Cowie et al. (2001). Some studies have 
been carried out to observe the influence of emotion in 
speech signals like the work presented by Rodriguez et 
al. (1999), but more recently, due the increasing power 
of modern computers that allows the analysis of huge 
amount of data in relatively small time lapses, machine 
learning techniques have been used to recognise emo- 
tions automatically by using labelled expressive speech 
corpora. Most of these studies have been centred on 
few algorithms and little sets of parameters. 

However, recent works have performed more 
exhaustive experiments testing different machine 
learning techniques and datasets, as the described by 
Oudeyer (2003). All this kind of studies had the goal 
of achieving the best possible recognition rate obtain- 
ing, in many cases, better results than those obtained 
in subjective tests ((Oudeyer, 2003), (Planet, Moran 
& Formiga, 2006), (Iriondo, Planet, Socoro & Alias, 
2007)). Nevertheless, many differences can be found 
when analyzing the results obtained from objective 
and subjective classifications and, to our knowledge, 
there are not studies with the goal of emulating these 
subjective criteria before those carried out by Iriondo, 
Planet, Alias, Socoro & Martinez (2007). 



VALIDATION OF AN EXPRESSIVE 
SPEECH CORPUS BY MAPPING 
SUBJECTIVE CRITERIA 



speech by means of this strategy for synthesis purposes 
(Cowie et al., 2005), although other authors argue in 
favor of constructing enormous corpora gathered from 
recordings of the daily life (Campbell, 2005). For the 
design of texts semantically related to different expres- 
sive styles, we have made use of an existing textual 
database of advertisements extracted from newspapers 
and magazines. Based on a study of the voice in the 
audio-visual publicity (Montoya, 1 998), five categories 
of the textual corpus have been chosen and the most 
suitable emotion/style has been assigned to them: 
New technologies (neutral-mature), education (joy- 
elation), cosmetic (style sensual-sweet), automobiles 
(aggressive-hard) and trips (sad-melancholic). The 
recorded database has 4638 sentences and it is 5 hours 
12 minutes long. 

From these categories, a set of sentences has been 
chosen by means of a greedy algorithm (Frangois & 
Boeffard, 2002) that has allowed us to select phoneti- 
cally balanced sentences. In addition to looking for a 
phonetic balance, phrases that contain foreign words 
and abbreviations have been discarded because they 
difficult the automatic process of phonetic transcrip- 
tion and labeling. 

The corpus has been segmented in phrases and then 
in phonemes by means of a semiautomatic process based 
on a forced alignment with Hidden Markov Models. 

Acoustic Analysis 



The creation of a speech corpus with authentic emo- 
tional content is one of the most important challenges 
in the study of expressive speech. Once the corpus is 
recorded, a validation process is required to prune those 
utterances that show distinct emotion to their label. 
This article is based on the work exposed by Iriondo, 
Planet, Alias, Socoro & Martinez (2007) and presents 
the production of an expressive speech corpus in Span- 
ish with the goal of being used in a synthesis system, 
validating it by pruning automatically "bad" utterances 
emulating the criteria of human listeners. 

The Production of the Corpus 

The recording of the corpus has been carried out by a 
female professional speaker. There is a high consensus 
in the scientific community for obtaining emotional 



Cowie et al. (2001) show how prosodic features of 
speech (fundamental frequency (F0), energy, duration 
of phones, and frequency of pauses) are related to vocal 
expression of emotion. The analysis of F0 performed 
in this work is based on the result of the pitch marks 
algorithm described by Alias, Monzo & Socoro (2006). 
This system can assign marks over the whole signal, 
interpolating values from the neighbour phonemes in 
unvoiced segments and silences. Energy is measured 
with 20 ms rectangular windows and 50% of overlap, 
computing the mean energy in decibels (dB) every 10 
ms. Also, rhythm parameters have been incorporated 
using the z-score as a means to analyze the temporal 
structure of speech (Schweitzer & Mobius, 2003). 
Moreover, for each utterance two parameters relating 
the number of pauses per time unit and the percentage 
of silence respect to the total time are considered. 



542 



Emulating Subjective Criteria in Corpus Validation 



Subjective Test 

A subjective test allows validating the expressive- 
ness of a corpus of acted speech from a user's point 
of view. Nevertheless, an extensive evaluation of the 
complete corpus would be very tedious due the big 
amount of elements in it. For this reason, only a 10 
percent of the corpus utterances have been selected 
for this test. A forced answer test has been designed 
using the TRUE platform (Planet, Iriondo, Martinez, 
& Montero, 2008) with the question: What emotional 
state do you recognize from the voice of the speaker in 
this phrase? The possible answers are the 5 emotional 
styles of the corpus and one more option Don 't know / 
Another (Dk/A) in order to minimize biasing the results 
due to confusing cases. The addition of this option has 
the risk of allowing some users to use excessively this 
answer to accelerate the end of the test (Navas, Hernaez 
& Luengo, 2006). However, this effect has not been 
observed in this experiment. The evaluators were 30 
volunteers with a quite heterogeneous profile. 

The achieved average classification accuracy in the 
subjective test is 87%. The test also reveals that sad 
style (SAD) is the best rated (98.5% in average). The 
second and third best rated styles are sensual (SEN) 
(87.2%) and neutral (NEU) (86. 1 %), followed by happy 
(HAP) (81.9%) and aggressive (AGR) (81.6%). Ag- 
gressive and happy styles are often confused between 
them. Moreover, sensual is slightly misclassified as sad 
or neutral. The Dk/A option is hardly used, although 
it is more present in neutral and sensual than in the 
rest of styles. 

To decide if utterances are not optimally performed 
by the speaker, two simple rules have been created 
from the subj ective test results by empirically adj usting 
two thresholds. These rules remove utterances whose 
identification percentage is lower than 50% or with a 
Dk/A percentage larger than 12%. There are 33 out of 
the 480 utterances of the subjective test that satisfy at 
least one rule. 

Statistical Analysis, Datasets and 
Supervised Classification 

Iriondo, Planet, Socoro & Alias (2007) present an 
experiment of emotion recognition covering differ- 
ent datasets and algorithms. The experiment is done 



with the same corpus that is being considered in this 
article. Each utterance is defined by 464 attributes 
representing the speech signal characteristics but this 
first dataset is divided into different subsets to reduce 
its dimensionality. The experiments show almost the 
same results in the full dataset and in a dataset reduced 
to 68 parameters, so the reduced dataset is being used 
in this work. In this dataset, the prosody of an utterance 
is represented by the vectors of logarithmic F0, energy 
in dB and normalized durations. For each sequence, 
the first derivative is also calculated. Some statistics 
are obtained from these sequences: mean, variance, 
maximum, minimum, range, skewness, kurtosis, quar- 
tiles, and interquartilic range. Thus, 68 parameters by 
utterance are computed, considering both parameters 
related to the pausing previously described. 

In the referenced work, twelve machine learning 
algorithms are tested considering different datasets. All 
the experiments are carried out using Weka software 
(Witten & Frank, 2005) by means of ten- fold cross- 
validation. Very high recognition rates are achieved as 
in other previously referenced works: SMO (SVM of 
Weka) obtained the best results (-97%) followed by 
Naive-Bayes (NB) with 94.6% and J48 (Weka Deci- 
sion Tree based on C4.5) with 93.5%, considering 
the average of all the results. The conclusion is that, 
in general, the styles of the developed speech corpus 
can be clearly differentiated. Moreover, the results of 
the subjective test showed a good authenticity of the 
expressive speech content. However, going one step 
further by developing a method to validate each utter- 
ance of the corpus following subjective criteria and 
not only the automatic classification from the space 
of attributes is considered necessary. 

Subjective Based Attribute Selection 

The proposed approach to find the optimum classifier 
schema able to follow the subjective criteria consists 
on the development of an attribute selection method 
guided by the results obtained in the subjective test. 
Once the best schema is found it will be applied to the 
whole corpus in order to generate a list of candidate 
utterances to be pruned. Two methods for attribute 
selection have been developed in order to determine 
the subset of attributes that allows a better mapping of 
the subjective test results. As previously mentioned, 
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the original set has 68 attributes per utterance, so an 
exhaustive exploration of subsets is not viable and 
greedy search procedures will be used. On the one 
hand, a Forward Selection (FW) process is chosen, 
which starts without any attribute and adds one at a 
time and, on the other hand, the Backward Elimina- 
tion (BW) technique is considered, which starts from 
the full set and deletes one attribute at a time. At each 
step, the classifier is tested with the 480 utterances that 
are considered in the subjective test, being previously 
trained with the 4158 remaining utterances. The wrong 
classified cases take part on the evaluation process of 
the involved subset of attributes. The novelty of this 
process is to use a subj ective-based measure to evaluate 
the expected performance of the subset at each iteration. 
The used measure is the Fl-score computed from the 
precision and the recall of the wrong classified utter- 
ances compared with the 33 utterances rejected by the 
subjective test. 

Results 

The process consists of six iterations : one per algorithm 
(SMO, NB and J48) and attribute selection technique 
(FW and BW). SMO algorithm obtains practically the 
same Fl-score (~ 0.50) in both FW and BW attribute 
selection techniques. The results for both NB are also 
similar between them (F 1 ~ 0.43). The main difference 
is for J48 that obtains better Fl with FW (0.45) than 
with BW (0.37) process. Moreover, SMO-FW seems to 
be the most stable configuration as it achieves almost 
the same result (Fl = 0.49) with a broad range of at- 
tribute subsets (the results are very similar with a range 
of attributes between 18 and 35). Results show that 
J48-FW has the best recall (18/33), which implies the 
highest number of coincidences; however the precision 
measure is quite low (18/51), indicating an excessive 
number of general misclassifications. 



FUTURE TRENDS 

Future work will consist on applying this automatic 
process over the full corpus. For instance, a ten-fold 
cross validation would be a good technique to cover 
the whole corpus. The misclassified utterances would 
be candidate to be pruned. A first approach would 
consist on running different classifiers and selecting 



the final candidates by a stacking technique (Witten & 
Frank, 2005). Also, we will evaluate the suitability of 
the proposed method by performing acoustic modeling 
of the resulting emotional speech after pruning with 
respect to the results obtained with the whole corpus. 
A lower error on the prosodic estimation could confirm 
the hypothesis that it is advisable to eliminate the bad 
utterances of the corpus previously. 



CONCLUSION 

This article exposes the need of an automatic validation 
of an expressive speech corpus due to the impossibility 
of carrying out a subjective test in a large corpus. Also, 
an approach to achieve this goal has been presented, 
performing a subjective test with 30 subjects and 10 
percent of the utterances, approximately. The result of 
this test has shown that some utterances are perceived 
with a dissimilar or poor expressive content with respect 
to their labeling. The proposed automatic classification 
tries to learn from the result of the subjective test in 
order to generalize the solution to the rest of the cor- 
pus, by means of a suitable attribute selection carried 
out by two different strategies (Forward Selection and 
Backward Elimination) using the F 1-measure computed 
taking into account the misclassifications resulting from 
the subjective test as a reference. 
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KEY TERMS 

Backward Elimination Strategy: Greedy attribute 
selection method that evaluates the effect of removing 
one attribute from a dataset. The attribute that improves 
the performance of the dataset when it is deleted is 
chosen to be removed for the next iteration. Process 
begins with the full set of attributes and stops when no 
attribute removing improves performance. 

Decision Trees: Classifier consisting on an arboreal 
structure. A test sample is classified by evaluating it 
in each node, starting at the top one and choosing a 
specific branch depending on this evaluation. The 
classification of the sample is the class assigned in 
the bottom node. 

Fl-Measure: The Fl-measure is an approach of 
combining the precision and recall measures of a classi- 
fier by means of an evenly harmonic mean of both them. 
Its expression is Fl-measure = (2xprecisionxrecall) / 
(precision+recall). 

Forward Selection Strategy: Greedy attribute 
selection method that evaluates the effect of adding 
one attribute to a dataset. The attribute that improves 
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the performance of the dataset is chosen to be added to 
the dataset for the next iteration. Process begins with 
no attributes and stops when adding new attributes 
provides no performance improvement. 

Greedy Algorithm: Algorithm -usually applied to 
optimization problems- based on the idea of finding a 
global solution for a problem (despite of not being the 
optimal one) by choosing locally optimal solutions in 
different iterations. 

Nai've-Bayes: Probabilistic classifier based on 
Baye's rule that assumes that all the pairs parameter- 
value that define a case are independent. 

Precision: Measure that indicates the percentage 
of correctly classified cases of one class with regard 
to the number of cases that are classified (correctly or 
not) as members of that class. This measure says if the 
classifier is assuming as members of one specific class 
cases from other different classes. 



Recall: Measure that indicates the percentage of 
correctly classified cases of one class with regard to the 
total number of cases that actually belong to this class. 
This measure says if the classifier is ignoring cases that 
should be classified as members of one specific class 
when doing a classification. 

SVM: Acronym of Support Vector Machines. SVM 
are models able to distinguish members of classes 
whose limits are not lineal. This is possible by a non- 
linear transformation of input data mapping it into a 
higher-dimensionality space where data can be easily 
divided by a maximum margin hyperplane. 



ENDNOTE 

1 Considering as valid utterances those with the 
adequate expressiveness. 
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INTRODUCTION 

Def ormable models are well known examples of arti- 
ficially intelligent system (AIS). They have played an 
important role in the challenging problem of extracting 
useful information about regions and areas of interest 
(ROIs) imaged through different modalities. The chal- 
lenge is also in extracting boundary elements belonging 
to the same ROI and integrate them into a coherent and 
consistent model of the structure. Traditional low-level 
image processing techniques that consider only local 
information can make incorrect assumptions during 
this integration process and generate unfeasible object 
boundaries. To solve this problem, deformable models 
were introduced (Ivins, 1994), (Mclnerney, 1996), 
(Wang, 2000). These AI models are currently important 
tools in many scientific disciplines and engineering 
applications (Duncan, 2000). 

Deformable models offer a powerful approach to 
accommodate the significant variability of structures 
within a ROI over time and across different individu- 
als. Therefore, they are able to segment, match and 
track images of structures by exploiting (bottom-up) 
constraints derived from the image data together with 
(top-down) a priori knowledge about the location, size, 
and shape of these structures. 

The mathematical foundations of deformable mod- 
els represent the confluence of geometry, physics and 
approximation theory. Geometry serves to represent 
object shape, physics imposes constraints on how the 
shape may vary over space and time, and optimal ap- 
proximation theory provides the formal mechanisms 



for fitting the models to data. The physical interpreta- 
tion views deformable models as elastic bodies which 
respond to applied force and constraints. 



BACKGROUND 

The deformable model that has attracted the most at- 
tention to date is the active contour model (ACM), 
well-known as snakes, presented by Kass et al. (Kass, 
1987), (Cootes & Taylor, 1992). The mathematical 
basis present in snake models is similar to all deform- 
able models, which are based on energy minimizing 
techniques. 

Recently, there has been an increasing interest in 
level set or geodesic segmentation methods, introduced 
in (Osher & Sethian, 1988), (Malladi, 1995) and (Ca- 
selles, 1997). Level set approach involves solving the 
ACM minimization problem by the computation of 
minimal distances curve. This method allows topo- 
logical changes within the ROIs and extension to 3D. 
Therefore, for some applications it is an improvement 
on classical ACM. 

Other approaches to deformable model are those 
based on dynamic models or physically based tech- 
niques, for example superquadrics (Terzopoulos, 1991) 
and the finite element model (FEM) (Pentland, 1991). 
The FEM accurately describes changes in position, 
orientation and shape. The FEM can be used to solve 
fitting, interpolation or correspondence problems. In 
the FEM, interpolation functions are developed that 
allow continuous material properties, such as mass 
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and stiffness, to be integrated across the ROIs. This 
last property makes them different from the previous 
models and therefore more suitable for some artificial 
vision applications. 

The next sections contain a brief introduction to the 
mathematical foundations of deformable models. 



ENERGY MINIMIZING DEFORMABLE 
MODELS 

Geometrically, an active contour model is a parametric 
contour embedded in the image plane (x, y) e W. The 
dynamic contour is represented as a time-varying curve, 
v(s,t) =(x(s,t), y(s,t)), where x andy are the coordinate 
functions and s e [0, 1] is the parametric domain. The 
curve evolves until the ROI, subject to constraints 
from a given image I(x, y), reaches an equilibrium. 
Thus, initially a curve is set around the ROI that, via 
minimization of an energy functional, moves normal to 
itself and stops at the boundary of the ROI. The energy 
functional is defined as: 

i 

E snake fr t) = j fe imema! (v( S , t)) + E„ tCTtM (v(S, t))} S 



(1) 



The first term, E imernal , represents the internal energy 
of the spline curve due to mechanical properties of 
the contour, stretching and bending. It is a sum of two 
components, the elasticity and rigidity energy: 
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where a controls the tension of the contour, while P 
controls its rigidity. Thus, this functions determinate 
how the snake can stretch or bend at any point s of the 
spline curve. The second terms couples the snake to 
the image: 



"* ext _ potential 



(s,t) = P(v(s,0) 



where P(v(s,t)) denotes a scalar potential function de- 
fined on the image plane. It is responsible for attracting 
the contour towards the object in the image (external 



energy). Therefore, it can be expressed as a weighted 
combination of energy function. 

To apply snakes to images, external potentials are 
designed whose local minima coincides with intensity 
extrema, edges and other image features of interest. 
For example, the contour will be attracted to intensity 
edges in an image by choosing a potential 

P(v(s,t)) = -c\v[G *I(x,y)] 

where c controls the magnitude of the potential, V is 
the gradient operator and G*I(x,y), denotes the image 
convolved with a Gaussian smoothing filter. 

In accordance with the calculus of variations, the 
contour v(s,t) that minimizes the energy of (1) must 
satisfy the Euler-Lagrange equation. Moreover, the La- 
grange equation of motion for a snake with the internal 
and external energy given by equation (1) is: 
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with a mass density \jl and a damping density y. This 
leads to dynamic deformable models that unify the 
description of shape and motion, making it possible 
to quantify not just static shape, but also shape evolu- 
tion through time. The first two terms of this partial 
differential equation represent inertial and damping 
forces. The remaining terms represent the internal 
stretching, the bending forces and the external forces. 
Equilibrium is achieved when the internal and exter- 
nal forces balance and the contour comes to rest, i.e., 
inertial and damping forces are zero, which yields the 
equilibrium condition. 

Traditional snake models are known to be limited 
in several aspects, such as their sensitivity to the initial 
contours. These are non-free parameters and do not 
handle changes in the topology of the shape. That is, 
when considering more than one object in the image, 
for instance for an initial prediction of v(s, t) surrounding 
all of them, it is not possible to detect all the objects. 
Special topology-handling procedures must be added. 
Some techniques have been proposed to solve these 
drawbacks. These techniques are based on information 
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fusion, dealing with ACM in addition to curvature driven 
flows and geometrical distance conditions (Solaiman, 
1999), (Caselles, 1997). 



GEODESIC ACTIVE CONTOUR MODEL 

The geodesic active contour model is based on the com- 
putation of a minimal distances curve. Thereby, the AC 
evolves following the geometric heat flow equation. Let 
us consider a particular class of snake models in which 
the rigidity coefficient is set to zero, that is P = 0. Two 
main reasons motivate this selection: i) this will allow 
us to derive the relation between these energy-based 
active contours and geometric curve evolution ones, ii) 
the regularization effect on the geodesic active contours 
comes from curvature-based curve flows, obtained only 
from the other terms in equation (1). This will allow 
to achieve smooth curves in the proposed approach 
without having the high-order smoothness given by (3 
^ in energy-based approaches. 

Moreover, this smoothness component in equation 
(1) appears in order to minimize the total squared curva- 
ture. It is possible to prove that the curvature flow used 
in the geodesic model decreases the total curvature. The 
use of the curvature-driven curve motions as smoothing 
terms was proven to be very efficient. Therefore, curve 
smoothing will be obtained also with (3 = 0, having only 
the first regularization term. Assuming this, equation 
(1) may be reduced to: 

£ 9CO (s ) = jf^^|v s (s,t)| 2 lds-Jc|V7[v(s,t)]ds 



(2) 



Observe that, by minimizing the functional of equa- 
tion (2), we are trying to locate the curve at the points 
of maxima | VI | (acting as edge detector) while keeping 
a certain smoothness in the curve (object boundary). 
This is actually the goal in the general formulation 
(equation 1) as well. 

It is possible to extend equation (2), generalizing 
the edge detector part in the following way: let {g: [0, 
oo[-> 3t + } be a strictly decreasing function, which acts 
as a function of the image gradient used for the stop- 
ping criterion. Hence we can replace -| VI | with g(\W\) 2 , 
obtaining a general energy function: 
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(3) 



The solution of the particular energy snake model of 
equation (3) is given by a geodesic curve in a Riemann 
space induced from the image I(x,y), where a geodesic 
curve is a local minimal distance path between given 
points. To show this, the classical Maupertuis' principle 
from dynamical systems together with the Fermat's 
principle is used. 

By assuming E. t = E t t t . „ it is possible to 

J ° internal ext_potentiaP r 

reduce the minimization of equation (1) to the fol- 
lowing form: 




mm 



v(s) 



= Jff(|VJ(x,y)|(v(s)).|v s (s)|ds 



This is done by using Euler-Lagrange, and defining 
an embedding function of the set of curves v(s), y/(s, 
t). That is an implicit representation of v(s), assuming 
that v(s) is a level set of a function y/(s,t):[0,a] *[0,b] 
-> 31, the following equation for curve/surface evolu- 
tion is derived: 



d\\f 



= g(VJ)(C + K)Vy 



where Cis a positive real constant and Kis the Euclidean 
curvature in the direction of the normal. 

To summarize the force (C+K) acts as the inter- 
nal force in the classical energy-based snake model, 
smoothness being provided by the curvature part of the 
flow. The heat-flow KMy/ is the regularization curvature 
flow that replaces the second order smoothness term 
in equation (1). The external-image dependent force 
is given by the stopping function, g(W). Thus this 
function stops the evolving curve when it arrives at 
the object's boundaries. 

The advantage of using a level set is that one can 
perform numerical computations involving curves and 
surfaces on a fixed Cartesian grid without having to 
parameterize these objects (this is called the Eulerian 
approach). Moreover, level set representation can 
handle topological shape changes as the surface evolves 
and it is less sensitive to initialisation. Figure 1 shows 
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the results of the level set approach applied to a bio- 
medical colour image. Figure l.a) shows the original 
image, b) show the initialization, which is composed 
by multiple curves. This is also an advantage over the 
classical ACM. Figure l.c) shows the results after 50 
iterations and d) show the final ROI detection. 

Figure 2 shows different results for ROI detection 
in biomedical images with the geodesic active contour 
model. Column (a) and (c) show the original image and 
(b) and (d) the final segmentation result. 

The model is efficient in time and accuracy. How- 
ever, there are also some drawbacks in terms of effi- 



ciency and convergence. It has non-free parameters, y/ 
is dependent of the time step, At, and the spatial one. 



DEFORMABLE FINITE ELEMENT 
MODEL 

A powerful approach to computing the local minima 
of a functional such as equation (1) is to construct 
a dynamical system that is governed by the energy 
minimization function and allows the system to evolve 
to equilibrium. 



Figure 1. Geodesic active contour model with multiple initializations 
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Figure 2. Results of ROI detection with the geodesic active contour model 






■■;-■" 




■' \\ 



k 

4k \ 



^y 






(a) 












1 %'^&^/ ^ 




, 









(c) 




(d) 



550 



Energy Minimizing Active Models in Artificial Vision 



In order to compute a minimum energy solution 
numerically, it is necessary to discretize the energy (1). 
Thus, the continuous model, v is represented in discrete 
form by a vector U of shape parameters associated 
with the local-support basis functions. The discretized 
version of the Lagrangian dynamics equation may be 
written as a set of second-order ordinary differential 
equations for the discrete nodal points displacements 
U, that is a vector of the (Ax, Ay, Az) displacements of 
the n nodal points that represents the ROI, thus: 



MU + CU + KU = F 



(4) 



That is the governing equation of the physical 
model, which is characterized by its mass matrix M, 
its stiffness matrix K and its dumping matrix C and 
the vector on the right hand side is describing the x, y 
and z components of the forces acting on the nodes. 
Equation (4) is known as the governing equation in the 
FEM method, and may be interpreted as assigning a 
certain mass to each nodal point and a certain material 
stiffness and damping between nodal points. 

The main drawback of the FEM is the large computa- 
tional expense. Another important problem when using 
the FEM for vision is that all the degrees of freedom 
are coupled. Therefore, closed-form solutions are not 
possible. In order to solve the problem the system may 
be diagonalized by means of a linear transform into 
the vibration modal space of the mesh modelling the 
ROI (Pentland, 1991). Thus, vector L/is transformed 
by a matrix P derived from the free vibrations modes 
of the equilibrium equation. Therefore, an eigenvalue 
problem can be derived with the basis set of eigensolu- 
tions composed by { w.,13 .} . The eigenvector IS . is called 



the i th mode's shape vector and w.is the corresponding 
frequency of vibration. Each eigenvector 13 consists of 
the (x, y, z) displacement for each mode that param- 
eterize the ROI. Usually the basis set is reduced by its 
Karhunen-Loeve (KL) expansion. 

The model maybe also extended to contain a set of 
ROIs. Thus, for a given image in a training set, a vec- 
tor containing the largest vibration modes describing 
the different deformable ROI surfaces is created. This 
random vector may be statistically constrained by re- 
taining the most significant variation modes of its KL 
expansion on the image data set. By these means, the 
conjunction of ROI surfaces maybe deformed according 
to the variability observed in the training set. 

Figure 3 shows the results of the FEM model ap- 
plied to a brain magnetic resonance image and a lung 
computed tomography image. Figure 3.a) shows the 
original structure inside the initial mesh surface and 
b) show the model obtained after evolve the spherical 
mesh by means of equation (3). Figure 3. c) shows the 
original lung structure and d) the final model when us- 
ing also a spherical mesh as initial surface. 



FUTURE TRENDS 

Deformable models are suitable for different applica- 
tions and domains such as computer vision, computa- 
tional fluid dynamics, computer graphics and biome- 
chanics. However, each model usually is application 
dependent and further improvements and processing 
of the model should be done to achieve satisfactory 
results. Techniques based on region properties, fuzzy 
logic theory and combination of different models have 




Figure 3. Results of the FEM model 
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already been suggested (Ray, 2001), (Bueno, 2004), 
(Yu, 2002), (Bridson, 2005). 

Moreover, these techniques have been broadly used 
to model real life phenomena, like fire, water, cloth, 
fracturing materials, etc. Results and models from these 
research areas may be of practical significance if they 
are applied in Artificial Vision. 



CONCLUSION 

Energy minimizing active models has been shown to be a 
powerful technique to detect, track and model interfaces 
and shapes of ROIs. Since the work on ACM by (Kass, 
1987) energy minimizing active models have been ap- 
plied and improved by using different mathematical and 
physical techniques. The most promising models are 
those presented here: the geodesic and the FEM one 
for ROI tracking and modeling respectively. 

The models are suitable for color and 3D images. 
The geodesic model may handle topological changes 
on the ROI surface and in not sensitive to initialization. 
The FEM model may represent the relative location of 
different ROI surfaces and it is able to accommodate 
their significant variability across different images of 
the ROI. The surfaces of each ROI are parameterized by 
the amplitudes of the vibration modes of a deformable 
geometrical mesh, which can handle small rotations and 
translations. However, as mentioned before there is still 
room for further improvements within these models. 
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KEY TERMS 

Active Model: It is a numerical technique for track- 
ing interfaces and shapes based on partial differential 
equations. The model is a curve or surface which itera- 
tively deforms to fit to an object in an image. 

Conventional Mathematical Modeling: The ap- 
plied science of creating computerized models. That is a 
theoretical construct that represents a system composed 
by set of region of interest, with a set of parameters, 
both variables together with logical and quantitative 
relationships between them, by means of mathemati- 
cal language to describe the behavior of the system. 
Parameters are determined by finding a curve in 2D or 
a surface in 3D, each patch of which is defined by a net 
of curves in two parametric directions, which matches 
a series of data points and possibly other constraints. 

Finite Element Method: Numerical technique used 
for finding approximate solution of partial differential 
equations (PDE) as well as of integral equations. The 
solution approach is based either on eliminating the dif- 
ferential equation completely (steady state problems), 
or rendering the PDE into an equivalent ordinary dif- 
ferential equation, which is then solved using standard 
techniques such as finite differences. 



Geodesic Curve: In presence of a metric, geodesies 
are defined to be (locally) the shortest path between 
points on the space. In the presence of an affine connec- 
tion, geodesies are defined to be curves whose tangent 
vectors remain parallel if they are transported along it. 
Geodesies describe the motion of point particles. 

Karhunen-Loeve: Mathematical techniques 
equivalent to Principal Component Analysis transform 
aiming to reduce multidimensional data sets to lower 
dimensions for analysis of their variance. 

Modal Analysis: Study of the dynamic properties 
and response of structures and or fluids under vibrational 
excitation. Typical excitation signals can be classed as 
impulse, broadband, swept sine, chirp, and possibly 
others. The resulting response will show one or more 
resonances, whose characteristic mass, frequency and 
damping can be estimated from the measurements. 

Tracking: Tracking is the process of locating a 
moving object (or several ones) in time. An algorithm 
analyses the image sequence and outputs the location 
of moving targets within the image. There are two 
major components of a visual tracking system; Target 
Representation and Localization and Filtering and 
Data Association. The 1 st one is mostly a bottom-up 
process which involve segmentation and matching. The 
2 nd one is mostly a top-down process, which involves 
incorporating prior information about the scene or 
object, dealing with object dynamics, and evaluation 
of different hypotheses. 
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INTRODUCTION 

"Machine Learning (ML) is the subfield of Artificial 
Intelligence conceived with the bold objective to de- 
velop computational methods that would implement 
various forms of learning, in particular mechanisms 
capable of inducing knowledge form examples or data" 
(Kubat, Bratko & Michalski, 1998, p. 3). 

The simplest and best-understood ML task is known 
as supervised learning. In supervised learning, each 
example consists of a vector of features (x) and a class 
(y). The goal of the learning algorithm is, given a set of 
examples and their classes, find a function, f, that can 
be applied to assign the correct class to new examples. 
When the function f takes values from a discrete set 

of classes {C l5 , C K ,}, f is called a classifier (Diet- 

terich, 2002). 

In the last decades it has been proved that learning 
tasks in which the unknown function f takes more than 
two values (multi-class learning problems) the better 
approach is to decompose the problem into multiple 
two-class classification problems (Ou & Murphey, 
2007) (Dietterich, & Bakiri, 1995) (Massulli & Val- 
entini, 2000). 

This article describes the implementation of a system 
whose main task is to classify prohibition road signs 
into several categories. In order to reduce the learning 
problem complexity and to improve the classification 
performance, the system is composed by a collection 
(ensemble) of independent binary classifiers. In the 
proposed approach, each binary classifier is a single- 



output neural network (NN) trained to distinguish a 
particular road sign kind from the others. 

The proposed system is a part of a Driver Support 
System (DSS) supported by the Spanish Government 
under project TRA2004-07441-C03-C02. For this 
reason, one of the main system requirements is that it 
should be implemented in hardware in order to use it 
aboard a vehicle for real time categorization. In order 
to fulfill this constraint, a reduction in the number of 
features that describe the instances must be performed. 
As consequence if we have k generic road sign types 
we will use k binary NN and k feature selection process 
will be executed. 



BACKGROUND 

It is known that road signs carry essential information 
for safe driving. Among other things, they permit or 
prohibit certain maneuvers, warn about risk factors, set 
speed limits and provide information about directions, 
destinations, etc. Therefore, road sign recognition is an 
essential task for the development of an autonomous 
Driver Support System. 

In spite of the increasing interest in the last years, 
traffic sign recognition is one of the less studied 
subjects in the field of Intelligent Transport Systems. 
Approaches in this area have been mainly focused on 
the resolution of other problems, such as road border 
detection (Dickmanns & Zapp, 1986) (Pomerlau & 
Jochem, 1996) or the recognition of obstacles in the 
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vehicle's path such as pedestrians (Franke, Gavrilla, 
Gorxig, Lindner, Paetzold & Wohler, 1 998) (Handmann, 
Kalinke, Tzomakas, Werner & Seelen, 1999) or other 
vehicles (Bertozzy & Broggi, 1998). 

When the number of road sign types is large, road 
sign recognition task is separated in two processes: 
detection and classification. Detection process is 
responsible for the localization and extraction of the 
potential signs from images captured by cameras. Only 
when the potential signs have been detected they can 
be classified as one of the available road sign-types. 

In the published researches, detection is based on 
color and/or shape of traffic signs (Lalonde & Li, 1995). 
On the other hand, to solve the classification task sev- 
eral ML algorithms have been used. Among the used 
techniques it is worth mentioning: The Markov Model 
(Hsien & Chen, 2003), Artificial Neural Networks (Es- 
calera, Moreno, Salich & Armingol, 1997) (Yang, Liu 
& Huang, 2003), Ring Partitioned Method (Soetedjo & 
Yammada 2005), the Matching Pursuit Filter (Hsu & 
Huang 2001) or the Laplace Kernel classifier (Paclik, 
Novovicova, Pudil, & Somol, 1999). 



A NEURAL NETWORK BASED SYSTEM 
FOR TRAFFIC SIGN RECOGNITION 

In this work, we present the architecture of a system 
whose task is to classify prohibition road signs into 
several categories. This task can be described as a 
supervised learning problem in which the input infor- 
mation comes from a set of road signs arranged in a 
fixed number of categories (classes) and the goal is to 
extract, from the input data, the real knowledge needed 
to classify correctly new signs. 

The proposed system is a Multilayer Perceptron 
(MLP) based classifier trained with the Back-Propaga- 
tion algorithm. In order to integrate this classification 
system into a DSS capable to perform real-time traffic 
sign categorization, a hardware implementation on Field 
Programmable Gate Array (FPGA) is necessary. 

With the aim of reducing the problem complexity, 
an ensemble of specialized neural networks is proposed. 
In addition and due to the strict size limitations of ANN 
implementation on FPGAs (Zhu & Sutton, 2003) the 
construction of each specialized MLP is combined 
with a specific reduction in the number of features that 
describes the examples. 



Traffic Sign Pre-Processing 

Since the signs to be classified are embodied in images 
acquired by a camera attached to a moving vehicle, it can 
be assumed that the signs have a varying size (signs get 
bigger as the vehicle moves toward them). Therefore, 
once the traffic signs have been detected, the first step 
is to normalize them to a specific size. The aim of this 
process is to ensure that all the signs (examples) are 
described by the same number of pixels (features). In 
our approach we have used 32x32 pixel signs. 

Once the signs have been normalized, a grayscale 
conversion is performed. Since the original images are 
represented in the RGB (Red, Green and Blue) color 
space, this conversion is done by adding the red, green 
and blue values for each pixel and dividing by three. As 
result of both processes, each road sign is transformed 
into a 1024 element vector in which each pixel is rep- 
resented by a real number in the range [0.0, 1.0]. 

System Architecture 

The general framework of the proposed system (Figure 
1) is composed of two modules: the Data Preprocessing 
Module (DPM) and the Classification Module (CLM). 
The DPMs function is to select from among the 1024 
attributes that describe a sign the subset that each spe- 
cialized neural network inside the CLM must receive. 
On the other hand, the CLMs function is to classify 
each input data set as one of the available prohibition 
road sign-types. Since this module is composed of 
several independent classifiers, in order to obtain the 
final classification, an integration of the individual 
predictions is required. 

To build both, the DPM and the CLM, a new data 
encoding schema is necessary. In particular, the multi- 
class problem has to be decomposed into a set of binary 
subproblems. 

Data Preprocessing Module 

Practical experience shows that using as much as pos- 
sible input information (features) does not imply higher 
output accuracy. Feature subset selection (Witten & 
Frank, 2005) (Hall, 1998) is the procedure of selecting 
just the relevant information, avoiding irrelevant and 
redundant information and reducing the learning task 
dimensionality. 
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Figure 1. General framework of the proposed architecture 
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The proposed architecture adopts a model in which 
the feature subset that describes an example is not 
unique but depends on the task associated to each clas- 
sifier. In other words, since the classification problem 
is divided in k binary sub-problems, k feature selection 
procedures are necessary. 

In this work, the feature selection module has been 
built using the Weka tool (Witten & Frank, 2005). At 
first, several feature selection algorithms from those 
included in Weka were considered (Sesmero, Alonso- 
Weber, Gutierrez, Ledezma & Sanchis, 2007). After 
analyzing both, the feature set size and the experimental 
results, combination of Best First (Russell & Norvig, 
2003) and Correlation-based Feature Selection (Hall, 
1998) was selected as base for the DPM construc- 
tion. 

Classification Module 

The Classification Module is based on an One Against 
All (OAA) model. In this modeling, the final classifier 
is composed of a collection of binary classifiers where 
each of them is specialized in discriminating a specific 
road sign type from the others. Therefore, for a clas- 
sification problem where k road sign types have to be 
separated, this approach results in a system in which 
for each existing class a different NN is used. 

Decomposing the global classifier into a set of inde- 
pendent NN not only reduces the complexity problem 
but also permits that the DPM is able to select the most 
significant attribute set for each binary classification 



task. In addition, each NN can have its own architecture 
(number of hidden nodes, activation function, learning 
rate, etc) and since there is no connection between the 
individual networks, the training can be performed 
distributing the work on several processors. 

ANNs Output Combination 

Once the binary NN's have been trained the global 
classifier system can be generated. However, since each 
classifier makes its own prediction, a decision module 
that integrates the results from the set of classifiers and 
produces a unique final classification is required. Experi- 
mentally it is found that, for the proposed classification 
task, the most efficient decision criterion is selecting 
the NN with the highest output value. Therefore, the 
formula used in the decision module is: 



f(x,fi,f 2 > fJ = argmax(f / ) 



(1) 



where f. is the output value of the neural network as- 
sociated to the i-th class. 

Classification Process 

When the system receives an unlabeled road sign to be 
classified in some of the fixed categories, such sign is 
sent to each classifier's input module. The DPM selects 
the pixel subset according to its relevant attribute list. 
The chosen pixels are used as the input for the asso- 
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dated ANN, which applies its knowledge to make a 
prediction. The individual predictions are sent to the 
decision module that carries out an integration of the 
received information and produces a unique final clas- 
sification. This process is shown is Figure 2. 

Empirical Evaluation 

The proposed system has been validated over 5000 
examples arranged in ten generic kinds of prohibition 
road signs: no pedestrians, no left/right turn ahead, no 
stopping and no parking, no overtaking, and 20-30-40- 
50-60 and 100 km speed limits. 

In order to evaluate our approach, three classifica- 
tion methods have been compared: 

The direct multi-class approach, 

The OAA approach with the full feature space 

and, 

The OAA approach with feature selection. 

In the direct multi-class approach (experiment 1), 
the classification problem has been solved with a MLP 
with 1024 (32 x 32) input nodes, one hidden layer with 
50 neurons and one output layer with 10 neurons. In 
this approach the class associated to each learning pat- 
tern is encoded using a vector C, which has as many 
components c. as existing class (10). The component 
value c. will be 1 if the sign belongs to class z, and 
in any other case. 

In the OAA approach with the full feature space 
(experiment 2) the previous net is split into ten binary 
ANN. In other words, this approach uses 10 binary 



MLP with 1024 input nodes, 36 hidden nodes and 1 
output node. 

Finally, in the OAA approach with Feature Selection 
(experiment 3), the problem is solved with an ensemble 
containing 10 binary MLP with 36 hidden nodes in 
each. The number of input units and, therefore, the 
feature space used by each ANN is determined by the 
DPM. This number is shown in Tablel. 

In order to build the binary classifiers used in the 
last two experiments, a new class encoding schema is 
necessary. In both cases, the class associated with each 
patter is encoded using a bit. Since in both experiments, 
the i-th binary classifier is trained to distinguish the 
class i from all the other class, the new encoding is 
equivalent to select the c. component from the previ- 
ous codification. 

In Table 2 we show the estimate classification ac- 
curacy for the described experiments when a 10-fold 
cross validation process is used. 

The experimental evaluation reflects that splitting 
the classification task into binary subtasks (experiment 
2) increases the classification accuracy. 

On the other hand, the loss of classification accu- 
racy when the feature selection process is performed 
(experiment 3) is not very significant compared with 
the benefits of the drastic input data reduction. 



FUTURE TRENDS 

The future work will be mainly focused on extending 
the system in order to cope with regulatory, warning, 
indication, etc, signs, i.e, with a bigger number of 
classes. This task will allow us to investigate and de- 




Figure 2. Classification process 
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f = argmax(f i ) 

/=i,.. .k 
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Table 1. Number of selected features; in the first column appears the label of each class 



Class 


Prohibition road sign 


Number of selected 
features 


CI 


no pedestrians 


116 


C2 


no (left, right) turn ahead 


91 


C3 


stopping and no parking 


44 


C4 


no passing 


114 


C5 


60 km speed limit 


114 


C6 


50 km speed limit 


110 


C7 


40 km speed limit 


100 


C8 


30 km speed limit 


114 


C9 


20 km speed limit 


103 


CIO 


100 km speed limit 


87 



Table 2. Summary of estimate classification accuracy (percentage) 



Experiment 


CI 


C2 


C3 


C4 


C5 


C6 


C7 


C8 


C9 


C10 


Global 


1 


100 


98,0 


100 


100 


96,0 


100 


90,0 


100 


96,0 


100 


97,8 


2 


100 


100 


100 


100 


96,0 


100 


96,0 


100 


98,0 


100 


99,0 


3 


100 


100 


98,0 


100 


92,0 


96,0 


88,0 


100 


98,0 


98,0 


97,0 



velop new procedures that will contribute to the design 
of a more versatile system. 

In the design of this new system, other multi-class ap- 
proach such as the One Against Higher Order Modeling 
(Lu & Ito, 1 999) and the Error-Correction Output Code 
(Dietterich & Bakiri, 1995) would be analyzed. 



CONCLUSION 

In this work, an architecture for traffic sign classifica- 
tion has been described. The software implementation 
shows very high recognition rates. For this reason this 
architecture can be considered as a good solution for 
the traffic sign classification problem. 

Moreover, the features of this architecture make 
it possible to implement this system on FPGAs and, 
therefore, to use it in real-time applications. 
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KEY TERMS 

Artificial Neural Network: Structure composes 
of a group of interconnected artificial neurons or units. 
The objective of a NN is to transform the inputs into 
meaningful outputs. 

Correlation-based Feature Selection: Feature 
Selection^ algorithm which heuristic measures the 
correlation between attributes and rewards those feature 
subsets in which each feature is highly correlated with 
the class and uncorrected with other subset features. 

r) Feature Selection: Process, commonly used 
in machine learning, of identifying and removing as 
much of the irrelevant and redundant information as 
possible. 
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Feature Space: n-dimensional space where each 
example (pattern) is represented as a point. The dimen- 
sion of this space is equal to the number of features 
used to describe the patterns. 

Field Programmable Array (FPGA): A FPGA 

is an integrated circuit that can be programmed in the 
field after manufacture. 

K-Cross-Validation: Method to estimate the accu- 
racy of a classifier system. In this approach, the dataset, 
D, is randomly split into K mutually exclusive subsets 
(folds) of equal size (D 1? D 2 , ..., D k ) and K classifiers 
are built. The i-th classifier is trained on the union of 
all D./jVi and tested on D.. The estimate accuracy is 
the overall number of correct classifications divided 
by the number of instances in the dataset. 



Machine Learning: Computer Scientific field 
focused on the design, analysis and implementation, 
of algorithms that learn from experience. 

One Against All: Approach to solve multi-class 
classification problems which creates one binary prob- 
lem for each of the K classes. The classifier for class 
z is trained to distinguish examples in class i from all 
other examples. 

Weka: Collection of machine learning algorithms 
for solving data mining problems implemented in Java 
and open sourced under the GPL. 
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INTRODUCTION 

Unsolicited commercial email also known as Spam 
is becoming a serious problem for Internet users and 
providers (Fawcett, 2003). Several researchers have ap- 
plied machine learning techniques in order to improve 
the detection of spam messages. Naive Bayes models 
are the most popular (Androutsopoulos, 2000) but 
other authors have applied Support Vector Machines 
(SVM) (Drucker, 1999), boosting and decision trees 
(Carreras, 2001) with remarkable results. SVM has 
revealed particularly attractive in this application be- 
cause it is robust against noise and is able to handle a 
large number of features (Vapnik, 1998). 

Errors in anti-spam email filtering are strongly asym- 
metric. Thus, false positive errors or valid messages 
that are blocked, are prohibitively expensive. Several 
authors have proposed new versions of the original 
SVM algorithm that help to reduce the false positive 
errors (Kolz, 2001, Valentini, 2004 & Kittler, 1998). 
In particular, it has been suggested that combining 
non-optimal classifiers can help to reduce particularly 
the variance of the predictor (Valentini, 2004 & Kittler, 
1998) and consequently the misclassification errors. 
In order to achieve this goal, different versions of the 
classifier are usually built by sampling the patterns 
or the features (Breiman, 1996). However, in our ap- 
plication it is expected that the aggregation of strong 
classifiers will help to reduce more the false positive 
errors (Provost, 2001 & Hershop, 2005). 

In this paper, we address the problem of reducing 
the false positive errors by combining classifiers based 
on multiple dissimilarities. To this aim, a diversity of 
classifiers is built considering dissimilarities that reflect 
different features of the data. 

The dissimilarities are first embedded into an 
Euclidean space where a SVM is adjusted for each 
measure. Next, the classifiers are aggregated using a 



voting strategy (Kittler, 1998). The method proposed 
has been applied to the Spam UCI machine learning 
database (Hastie, 2001) with remarkable results. 



THE PROBLEM OF DISSIMILARITIES 
REVISITED 

An important step in the design of a classifier is the 
choice of the proper dissimilarity that reflects the 
proximities among the objects. However, the choice 
of a good dissimilarity for the problem at hand is not 
an easy task. Each measure reflects different features 
of the dataset and no dissimilarity outperforms the 
others in a wide range of problems. In this section, we 
comment shortly the main differences among several 
dissimilarities that can be applied to model the prox- 
imities among emails. For a deeper description and 
definitions see for instance (Cox, 2001). 

The Euclidean distance evaluates if the features 
that codify the spam differ significantly among the 
messages. This measure is sensible to the size of the 
emails. The cosine dissimilarity reflects the angle be- 
tween the spam messages. The value is independent 
of the message length. It differs significantly from the 
Euclidean distance when the data is not normalized. The 
correlation measure checks if the features that codify 
the spam change in the same way in different emails. 
Correlation based measures tend to group together 
samples whose features are linearly related. The cor- 
relation differs significantly from the cosine if the mean 
of the vectors that represents the emails are not zero. 
This measure is distorted by outliers. The Spearman 
rank correlation avoids this problem by computing a 
correlation between the ranks of the features. Another 
kind of correlation measure that helps to overcome the 
problem of outliers is the kendall-x index which is related 
to the Mutual Information probabilistic measure. 
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When the emails are codified in high dimensional 
and noisy spaces, the dissimilarities mentioned above 
are affected by the * curse of dimensionality' (Ag- 
garwal, 2001 & Martin-Merino, 2004). Hence, most 
of the dissimilarities become almost constant and the 
differences among dissimilarities are lost (Hinneburg, 
2000 & Martin-Merino, 2005). This problem can be 
avoided selecting a small number of features before 
the dissimilarities are computed. 



COMBINING DISSIMILARITY BASED 
CLASSIFIERS 

In this section, we explain how the SVM can be ex- 
tended to work directly from a dissimilarity measure. 
Next, the ensemble of classifiers based on multiple 
dissimilarities is presented. Finally we comment briefly 
the related work. 

The SVM is a powerful machine learning technique 
that is able to deal with high dimensional and noisy 
data (Vapnik, 1998). In spite of this, the original SVM 
algorithm is not able to work directly from a dissimilar- 
ity matrix. To overcome this problem, we follow the 
approach of (Pekalska, 2001). First, the dissimilari- 
ties are embedded into an Euclidean space such that 
the inter-pattern distances reflect approximately the 
original dissimilarity matrix. Next, the test points are 
embedded via a linear algebra operation and finally the 
SVM is trained and evaluated. We comment briefly the 
mathematical details. 

Let D g R nxn be the dissimilarity matrix made up of 
the object proximities for the training set. A configura- 
tion in a low dimensional Euclidean space can be found 
via a metric multidimensional scaling algorithm (MDS) 
(Cox, 2001) such that the original dissimilarities are 
approximately preserved. Let X= [x 1 ,...,x n ] T e R nxp 
be the matrix of the object coordinates for the train- 
ing patterns. Define B = X X T as the matrix of inner 
products which is related to the dissimilarity matrix 
via the following equation: 



a singular value decomposition (Golub, 1996): 



X k = V k A k - 



(2) 



where V k e R nxk is an orthogonal matrix with columns 
the first k eigen vectors of X X T and A k = diag(A, 1 . . . A, k ) g 
R 10 * is a diagonal matrix with A,, the i-th eigenvalue. 
Several dissimilarities introduced in section 2 generate 
inner product matrices B non semi-definite positive. 
Fortunately, the negative values are small in our 
application and therefore can be neglected without 
losing relevant information about the data (Pekalska, 
2001). 

Once the training patterns have been embedded into 
a low dimensional Euclidean space, the test pattern can 
be added to this space via a linear projection (Pekalska, 
2001). Next we comment briefly the derivation. 

Let X k e R rak be the object configuration for the 
training patterns in R k and X n = [x ,.., x g ] T e R sxk the 
matrix of the object coordinates sought for the test 
patterns. Let D n (2) e R sxn be the matrix of the square 
dissimilarities between the s test patterns and the n 
training patterns that have been already projected. The 
matrix B n g R sxn of inner products among the test and 
training patterns can be found as: 



B = - V* (D < 2 > J - U D< 2 > J) 



(3), 



where J g R nxn is the centering matrix and U = 1/n 
1 T 1 e R sxn . The derivation of equation is detailed in 
(Pekalska, 2001). Since the matrix of inner products 
verifies 



B =X X T , 

n n k 7 



(4) 



then, X n can be found as the least mean-square error 
solution to (4), that is: 



X =B XJX/XJ" 

n n k v k k' 



(5) 



B = -l/2 JD< 2 >J 



(1) 



Given that X k T X k = A k and considering that X k 
= V k A k 1/2 the coordinates for the test points can be 
obtained as: 



where J= I - 1/n 1 1 T g R nxn is the centering matrix, I is 
the identity matrix and D (2) = (S.. 2 ) is the matrix of the 
square dissimilarities for the training patterns. If B is 
positive semi-definite, the object coordinates in the low 
dimensional Euclidean space R k can be found through 



X =B VA " 1/2 , 

n n k k ' 



(6) 



which can be easily evaluated through simple linear 
algebraic operations. 
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Figure 1. Aggregation of classifiers using a voting 
strategy. Bold patterns are missclassified by a single 
hyperplane but not by the combination. 



Class 2 




Class 1 



The combination strategy proposed here is based on 
the evidence that different dissimilarities reflect differ- 
ent features of the dataset (see section 2). Therefore, 
classifiers based on different measures will missclassify 
a different set of patterns. 

Figure 1 shows for instance that bold patterns are 
assigned to the wrong class by only one classifier but 
using a voting strategy the patterns will be assigned 
to the right class. 

Hence, our combination algorithm proceeds as fol- 
lows: First, the dissimilarities introduced in section 2 
are computed. Each dissimilarity is embedded into an 
Euclidean space, training and test pattern coordinates 
are obtained using equations (2) and (6) respectively. To 
increase the diversity of classifiers, once the dissimilari- 
ties are embedded a bootstrap sample of the patterns is 
drawn. Next, we train a S VM for each dissimilarity and 
bootstrap sample. Thus, it is expected that misclassifica- 
tion errors will change from one classifier to another. 
So the combination of classifiers by a voting strategy 
will help to reduce the misclassification errors. 

A related technique to combine classifiers is the 
Bagging (Breiman, 1996 & Bauer, 1999). This method 
generates a diversity of classifiers that are trained us- 
ing several bootstrap samples. Next, the classifiers are 
aggregated using a voting strategy. Nevertheless there 
are three important differences between bagging and 
the method proposed in this section. 



First, our method generates the diversity of classi- 
fiers by considering different dissimilarities and thus 
will induce a stronger diversity among classifiers. 
A second advantage of our method is that it is able 
to work directly with a dissimilarity matrix. Finally, 
the combination of several dissimilarities avoids the 
problem of choosing a particular dissimilarity for the 
application we are dealing with. This is a difficult and 
time consuming task. 

Notice that the algorithm proposed earlier can be 
easily applied to other classifiers such as the k-nearest 
neighbor algorithm that are based on distances. 



EXPERIMENTAL RESULTS 

In this section, the ensemble of classifiers proposed is 
applied to the identification of spam messages. 

The spam collection considered is available from 
the UCI Machine learning database (Hastie, 200 1 ). The 
corpus is made up of 4601 emails from which 39.4 % 
are spam and 60.6 % legitimate messages. The number 
of features considered to codify the emails is 57 and 
they are described in (Hastie, 2001). 




Figure 2. Eigenvalues for the multidimensional scaling 
algorithm with the cosine dissimilarity 



as s 
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The dissimilarities have been computed without 
normalizing the variables because this may increase 
the correlation among them. Once the dissimilarities 
have been embedded in a Euclidean space, the variables 
are normalized to unit variance and zero mean. This 
preprocessing improves the SVM accuracy and the 
speed of convergence. 

Regarding the ensemble of classifiers, an important 
issue is the dimensionality in which the dissimilarity 
matrix is embedded. To this aim, a metric Multidi- 
mensional Scaling algorithm is first run. The number 
of eigenvectors considered is determined by the curve 
induced by the eigenvalues. For the dataset considered, 



figure 2 shows that the first twenty eigenvalues preserve 
the main structure of the dataset. 

The combination strategy proposed in this paper 
has been also applied to the k-nearest neighbor clas- 
sifier. An important parameter in this algorithm is the 
number of neighbors which has been estimated using 
20 % of the patterns as a validation set. 

The classifiers have been evaluated from two differ- 
ent points of view: on the one hand we have computed 
the misclassification errors. But in our application, 
false positives errors are very expensive and should 
be avoided. Therefore false positive errors are also 
computed. 



Table 1. Experimental results for the ensemble of SVM classifiers. Classifiers based solely on a single dissimilar- 
ity and Bagging have been taken as reference 





Linear Kernel 


Polynomial Kernel 


Method 


Error 


False positive 


Error 


False positive 


Euclidean 


8.1% 


4.0% 


15% 


11% 


Cosine 


19.1% 


15.3% 


30.4% 


8% 


Correlation 


18.7% 


9.8% 


31% 


7.8% 


Manhattan 


12.6% 


6.3% 


19.2% 


7.1% 


Kendall-x 


6.5% 


3.1% 


11.1% 


5.4% 


Spearman 


6.6% 


3.1% 


11.1% 


5.4% 


Bagging Euclidean 


7.3% 


3.0% 


14.3% 


4% 


Combination 


6.1% 


3% 


11.1% 


1.8% 



Parameters: Linear kernel: C=0.1, m=20; Polynomial kernel :Degree=2, C=5, m=20 



Table 2. Experimental results for the ensemble of k-NN classifiers. Classifiers based solely on a single dissimi- 
larity and Bagging have been taken as reference 



Method 


Error 


False positive 


Euclidean 


22.5% 


9.3% 


Cosine 


23.3% 


14.0% 


Correlation 


23.2% 


14.0% 


Manhattan 


23.2% 


12.2% 


Kendall-x 


21.7% 


6% 


Bagging 


19.1% 


11.6% 


Combination 


11.5% 


5.5% 



Parameters: k=2 
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Finally the errors have been evaluated considering a 
subset of 20 % of the patterns drawn randomly without 
replacement from the original dataset. 

Table 1 shows the experimental results for the 
ensemble of classifiers using the SVM. The method 
proposed has been compared with bagging introduced 
in section 3 and with classifiers based on a single dis- 
similarity. 

From the analysis of table 1 the following conclu- 
sions can be drawn: 

• The combination strategy improves significantly 
the Euclidean distance which is usually considered 
by most SVM algorithms. 
The combination strategy with polynomial kernel 
reduces significantly the false positive errors of the 
best single classifier. The improvement is smaller 
for the linear kernel. This can be explained because 
the non-linear kernel allow us to build classifiers 
with larger variance and therefore the combina- 
tion strategy can achieve a larger improvement 
of the false positive errors. We also report that 
for the combination strategy as the C parameter 
increases the false positive errors converge to 
although the false negative errors increase. 
The combination strategy proposed outperforms 
a widely used aggregation method such as Bag- 
ging. The improvement is particularly important 
for the polynomial kernel. 

Table 2 shows the experimental results for the en- 
semble of k-NNs classifiers. As in the previous case, 
the combination strategy proposed improves particu- 
larly the false positive errors of classifiers based on 
a single distance. We also report that Bagging is not 
able to reduce the false positive errors of the Euclidean 
distance. Besides, our combination strategy improves 
significantly the Bagging algorithm. Finally, we observe 
that the misclassification errors are larger for k-NN 
than for the SVM. This can be explained because the 
SVM has a higher generalization ability. 



CONCLUSIONS AND FUTURE 
RESEARCH TRENDS 

In this paper, we have proposed an ensemble of clas- 
sifiers based on a diversity of dissimilarities. Our ap- 
proach aims to reduce particularly the false positive 



errors of classifiers based solely on a single distance. 
Besides, the algorithm is able to work directly from a 
dissimilarity matrix. The algorithm has been applied 
to the identification of spam messages. 

The experimental results suggest that the method 
proposed help to improve both, misclassification er- 
rors and false positive errors. We also report that our 
algorithm outperforms classifiers based on a single 
dissimilarity and other combination strategies such 
as bagging. 

As future research trends, we will try to apply other 
combination strategies that assign different weight to 
each classifier. 
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KEY TERMS 

Bootstrap : Resampling technique based on several 
random samples drawn with replacement. 

Dissimilarity: It is a measure of proximity that does 
not obey the triangle inequality. 

Kernel: Non-linear transformation to a high di- 
mensional feature space. 

K-NN: K-Nearest Neighbor algorithm for classica- 
tion purposes. 

MDS: Multidimensional Scaling Algorithm applied 
for the visualization of high dimensional data. 

SVD: Singular Value Decomposition. Linear 
algebra operation that is used by many optimization 
algorithms. 

SVM: Support Vector Machines classifier. 

UCE: Unsolicited Commercial Email, also known 
as Spam. 
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INTRODUCTION 

Evolutionary algorithms are well known optimiza- 
tion techniques suitable for solving various kinds 
of problems (Ruano, 2005). The new application of 
evolutionary algorithms represents their use in the 
detection of biased control loop functions caused by 
controlled variable sensor discredibility (Klimanek, 
Sulc, 2005). Sensor discredibility occurs when a sensor 
transmitting values of the controlled variable provides 
inexact information, however the information is not 
absolutely faulty yet. Use of discredible sensors in 
control circuits may cause the real values of controlled 
variables to exceed the range of tolerated differences, 
whereas zero control error is being displayed. However, 
this is not the only negative consequence. Sometimes, 
sensor discredibility is accompanied with undesirable 
and hardly recognizable side effects. Most typical is an 
increase of harmful emission production in the case of 
combustion control (Sulc, Klimanek, 2005). 

We have found that evolutionary algorithms are 
useful tools for solving the particular problem of 
finding a software-based way (co-called software re- 
dundancy) of sensor discredibility detection. Software 
redundancy is a more economic way than the usual 
hardware redundancy, which is otherwise necessary 
in control loop protection against this small, invisible 
control error occurrence. 

Namely, the standard genetic algorithm and the 
simulated annealing algorithm have been successfully 
applied and tested to minimize the given cost func- 
tion; by means of these algorithms newly developed 
method is able to detect controlled variable sensor 
discredibility. When applied to combustion processes, 
production of harmful emissions can be kept within 
accepted limits. 

Used application of evolutionary algorithms inclu- 
sive terminology transfer reflecting this application area 



can serve as an explanatory case study helping readers 
in better understanding the way how the evolutionary 
algorithms operate. 



BACKGROUND 

The above-mentioned controlled variable sensor dis- 
credibility detection represents a specific part of the 
fault detection field in control engineering. According to 
some authors (Venkatasubramanian, Rengaswamy, 200 
3, Korbic, 2004), fault detection methods are classified 
into three general categories: quantitative model-based 
methods, qualitative model-based methods, and process 
history based methods. In contrast to the mentioned ap- 
proaches, where priori knowledge about the process is 
needed, for the controlled variable sensor discredibility 
detection it is useful to employ methods of evolution- 
ary algorithms. The main advantage of such a solution 
is that necessary information about the changes in 
controlled variable sensor properties can be obtained 
with the help of evolutionary algorithms based on the 
standard process data - this is, in any case, acquired 
and recorded for the sake of process control. 

In order to apply evolutionary algorithms to con- 
trolled variable sensor discredibility detection, a cost 
function was designed as a residual function e defined 
by the absolute value of difference between the sensor 
model output (yj and the real sensor output (y real ), 



V rpal s rr 



(1) 



The design of residual function e has been explained 
in detail (e.g. in Sulc, Klimanek, 2005). 

In most sensor models it is assumed that the sensor 
output is proportional only to one input (Koushanfar, 
2003), so that the sensor model equation is 
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y, 



= k 



i n 



(2) 



where parameter k m represents the gain of the sensor 
model, parameter q m expresses the shift factor, and x est 
is the estimated sensor model input, which represents 
the physical (real) value of the control variable. The 
physical value of the control variable is not available 
for us because we expect that the sensor is not reliable 
and we want to detect this stage. However, we can 
estimate this value from the other process data that are 
acquired usually for the purposes of the information 
system. This estimation is usually based on steady-state 
data, so that it is important to detect the steady state 
of the process. 

Basically, the underlying idea of applying the evo- 
lutionary algorithm is then based on finding a vector 
of the sensor model parameter for which the value of 
residual function e is minimal. 

Advantages of the Evolutionary 
Algorithm Applied to Discredibility 
Detection 

In principle, any optimization method could be used 
for the mentioned optimization task. The problem is 
that the sensor model input is an unknown, dynami- 
cally-changing variable. Therefore, the choice and 
the parameter selection must include certain element 
of a random selection from many alternatives, which 
is fulfilled in the case of evolutionary algorithms. 
The higher computational time requirements do not 
matter in the case of sensor discredibility detection, 
because the loss of credibility is the result of a gradual 
development. 

Problem Statement 

A particular task of evolutionary algorithms in the 
solved problem is e.g. a finding extreme of a given cost 
function. We have utilized the evolutionary algorithms 
to minimize the given cost function (in fault detec- 
tion terminology a residual function). Based on this 
minimization, it is possible to detect that the control 
variable sensor is providing biased data. 



THE STANDARD GENETIC ALGORITHM 
AND THE SIMULATED ANNEALING 
ALGORITHM IN DESCREDIBILITY 
DETECTION 

Both methods have been tested and proved to be legiti- 
mate for use. Unlike general genetic presentations of the 
methods, we will present the methods in a transformed 
way, based on the use of terms from the field of fault 
detection. From the engineering view point this should 
facilitate understanding of both procedures (Klimanek, 
Sulc, 2005). In our text, the terms introduced in the 
theory of evolutionary algorithms are indicated by the 
abbreviation "ET". 

The Standard Genetic Algorithm 

In controlled variable sensor discredibility detection 
that uses genetic algorithm methods, the following 
steps are required (procedure by Fleming & Purshouse, 
1995) (Figure 1): 

1. Initialization - during initialization, the evolu- 
tionary time is set to zero and an initial set of 
vectors containing the sensor model parameters 
(called population in ET) is randomly generated 
within an expected range of reasonable values for 
each of the parameters. For each of the parameter 
vectors of the sensor model (in ET, individuals of 
the population), the value of the residual function 



Figure 1 A flow chart of the standard genetic algorithm 
applied for discredibility detection 
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(1) is evaluated. Also, the average value of the 
residual function values is computed. 
New set of parameter vectors - after starting the 
iteration process, a new set of parameter vectors 
is generated (in ET, new population) when the 
selection operator is employed. Selecting a set of 
the new parameter vectors, the following algo- 
rithm is used: the parameter vector of the sensor 
model that provides residual values lower than 
the average value is replicated into the next subset 
for generating new parameter candidates in more 
copies than in the original set, and the individuals 
with below-average residual values are rejected 
(Witczak, Obuchowicz, Korbicz, 2002). 
Crossover and mutation (ET) - next comes the 
crossover operation over the randomly selected 
pairs of the parameter vector of the sensor model 
of the topical set. In the presenting application of 
the standard genetic algorithm, the selected sensor 
model parameters are coded into binary strings 
and the standard one point crossover operator 
is used. The mutation operator mimics random 
mutations (Fleming, Purshouse, 1995). The newly 
created parameters are coded into binary strings 
and one bit of each string is switched with random 
probability. The value of the residual function 
(1) for the current run is evaluated. Also, the 
average value of the residual function values is 
computed. 

Stopping-criterion decision - if the stopping 
criterion is not met, a return to step 2 repeats the 
process. The stopping criterion is met, e.g. when 
the size of the difference between the average of 
the residual values from the current run and the 
average of the residual values from the previous 
run is lower then the given size (Fleming, Purs- 
house, 2002). 



in step 3. A vector of random values of the sensor 
model parameters is selected and the value of the 
residual function is computed. 

6. New set of parameter vectors - the iteration index 
is increased and, using a stochastic strategy, a new 
vector of the sensor model parameters is randomly 
generated (in ET it is spoken about generating 
new individuals) and the corresponding value of 
the residual variable is obtained (in ET, value of 
the cost function). 

7. Boltzmann criterion computation - the difference 
between the residual value obtained in step 2 and 
the residual value from the previous iteration is 
evaluated. If the difference is negative, then the 
new parameter vector is accepted automatically. 
Otherwise, the algorithm may accept the new pa- 
rameter vector based on the Boltzmann criterion. 
The control parameter is weighted with a coef- 
ficient X (in ET, gradual temperature reduction). 
If the control parameter is less than or equal to 
the given final control parameter, then the stop 
criterion is met and the current vector of the sensor 
model parameters is accepted. Otherwise, return- 
ing to step 3 repeats the process of optimizing 
search. 

Comparison of Usability of the 
Algorithms for Discredibility Detection 

The comparison of both evolutionary algorithms applied 
to controlled variable sensor discredibility detection 



Figure 2. A flow chart of the simulated annealing al- 
gorithm applied for dis -credibility detection 




Simulated Annealing Algorithm 

Controlled variable sensor discredibility detection via 
simulated annealing can be described by the following 
steps (Figure 2): 

5. Initialization - an initial control parameter is set 
(in ET, initial annealing temperature). The control 
parameter is used to evaluate the Boltzmann crite- 
rion (King, 1999), which affects the acceptance of 
the current parameter vector of the sensor model 
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is shown by Figure 5. This comparison represents a 
part of results form paper Sulc, Klimanek. (2006). 
It is evident, that the simulated annealing algorithm 
needs more evaluation time for one evaluation period 
- a period for simulated annealing required 80 itera- 
tions, while genetic algorithm needed 40 iterations. 
This difference is because genetic algorithm works 
with a group of potential solutions, while simulated 
annealing compares only two potential solutions and 
accepts better one. 

No difference was found between the two evolu- 
tionary algorithms used here; their good convergence 
depends mainly on the algorithm settings. Although 
evolutionary algorithms are generally much more 
time consuming than other optimizing procedures, this 
consideration does not matter in control variable sensor 
discredibility detection. This is because control variable 



sensor discredibility has no conclusive impacts on the 
control results and the time needed for the detection 
does not affect the control process. 

Testing Model-Based Sensor 
Discredibility Detection Method 

The model-based control variable sensor discredibility 
detection method using evolutionary algorithms was 
tested to find whether the method is able to detect the 
control variable sensor properties changes via presented 
evolutionary algorithms. The simulation experiments 
are more described in (Klimanek, Sulc 2006a, Kli- 
manek, Sulc 2006b). Results from the simulated experi- 
ments were summarized and they can be graphically 
demonstrated in the next paragraph. 



Figure 3. Detection of gradual changes of the level sensor gain via genetic algorithm and the simulated anneal- 
ing algorithm 
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Results and Findings from the Tests 

Figure 4 depicts a simulation run during which the 
sensor gain has been gradually decreased from a start- 
ing (correct) value. It can be seen that after the sensor 
properties has been changed, the measured value of 



the controlled variable (in this case the water level) is 
different from the correct value. 

It is apparent that the algorithm used for sensor model 
parameter detection (in this case the genetic algorithm) 
is able to find the sensor model gain k , because the 
sensor model parameter development corresponds to 
the simulated real sensor parameter changing. 




Figure 4. Detection of gradual changes of the level sensor gain via genetic algorithm 
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The sensor level discredibility detection results 
obtained using the simulated annealing algorithm, 
were similar. Figure 5 shows results obtained when 
the model-based method using the simulated annealing 
algorithm was tested. A step change of sensor gain was 
simulated and it is obvious that the algorithm was able 
to capture the change. 



By this method the operator is informed about the 
estimated time remaining before sensor discredibility 
occurs. If the time is critical, the operator also receives 
timely a warning about the situation. 



Figure 5. Detection of the step change of the level sensor gain via simulated annealing method 
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FUTURE TRENDS 

It can be expected in future that tools for the controlled 
variable discredibility detection will be a standard acces- 
sory of any standard PID controller. On the present from 
economic reasons, the cases, when the described method 
using evolutionary algorithms is mostly justified to be 
implemented, are linked to revelation of undesirable 
side effects (increased production of harmful emissions 
CO, NO x , or, in the case of bioenergetic processes, 
increase of unwanted production of C0 2 ). Except use 
for a standard detection of the changes in the controlled 
variable sensor, the discredibility detection provides 
possibilities to warn operator against occurrence of the 
control loop inaccuracy, or even to forecast it. 



CONCLUSION 

Implementation feasibility of the evolutionary algo- 
rithms in a model-based controlled variable sensor 
discredibility has been demonstrated here. In two 
variations of the standard evolutionary algorithms 
- the genetic algorithm and the simulated annealing 
algorithm, designed procedure of sensor discredibility 
detection was presented. 

In both cases, the time needed for the evaluation 
was several minutes. In the case of the application for 
discredibility detection, this time demand does not 
matter because such small malfunctions do not lead to 
fatal errors in control loop operation and discredibility 
is usually a long developing process. 

Evolutionary algorithms have become a useful tool 
in discovering hidden inaccuracy in the control loops. 
Discredibility detection saves costs on redundant 
controlled variable sensors, which are required if the 
controlled variable sensor discredibility is detected via 
hardware redundancy on the assumption that the costs 
for additional sensors are not negligible, of course. 

The importance of discredibility detection using 
evolutionary algorithms can be found, e.g. in biomass 
combustion processes (due to the penalties for over- 
stepped limits in harmful emissions), and also in the 
food-processing industry, where side effects may not 
be harmful, but rather unpleasant (i.e. bad odors). 
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KEY TERMS 

The following terms and definitions introduce a fault 
detection engineering interpretation of the terms usual 
in the evolutionary algorithm vocabulary. This should 
facilitate orientation in the presented engineering 
problem. 

Chromosome : Aparticular sensor model parameter 
vector (a term for individuals used in evolutionary 
terminology). 

Cost Function: A criterion evaluating level of the 
congruence between the sensor model output and the 



real sensor output. In the fault detection terminology, 
the cost function corresponds to the term residuum (or 
residual function). 

Evolutionary Time: The number assigned to steps 
in the sequence of iteration performed during a search 
for sensor model parameters based on evolutionary 
development. 

Individual: A vector of the sensor model parameters 
in a set of possible values (see population). 

Initial Annealing Temperature: An initial algo- 
rithm parameter. Annealing temperature is used as a 
measure of evolutionary progress during the simulated 
annealing algorithm run. 

Population : A set of the vectors of the sensor model 
parameters with which the sensor model has a chance 
to approach the minimum of the residual function. 

Population Size: The number of the sensor model 
parameter vectors taken into the consideration in 
population. 

Sensor Discredibility: A stage of the controlled 
variable sensor at which the sensor is not completely 
out of function yet, but its properties have gradually 
changed to the extend that the data provided by the 
sensor are so biased that the tolerated inaccuracy of the 
controlled variable is over-ranged and usually linked 
with possible side effects. 
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INTRODUCTION 

Artificial neural networks (ANNs) are computational 
models, loosely inspired by biological neural networks, 
consisting of interconnected groups of artificial neu- 
rons which process information using a connectionist 
approach. 

ANNs are widely applied to problems like pattern 
recognition, classification, and time series analysis. The 
success of an ANN application usually requires a high 
number of experiments. Moreover, several parameters 
of an ANN can affect the accuracy of solutions. A par- 
ticular type of evolving system, namely neuro-genetic 
systems, have become a very important research topic in 
ANN design. They make up the so-called Evolutionary 
Artificial Neural Networks (EANNs), i.e., biologically- 
inspired computational models that use evolutionary 
algorithms (EAs) in conjunction with ANNs. 

Evolutionary algorithms and state-of-the-art design 
of EANN were introduced first in the milestone survey 
by Xin Yao (1999), and, more recently, by Abraham 
(2004), by Cantu-Paz and Kamath (2005), and then by 
Castellani (2006). 

The aim of this article is to present the main evolu- 
tionary techniques used to optimize the ANN design, 
providing a description of the topics related to neural 
network design and corresponding issues, and then, 
some of the most recent developments of EANNs found 
in the literature. Finally a brief summary is given, with 
a few concluding remarks. 



ARTIFICIAL NEURAL NETWORK 
DESIGN 

In ANN design, the successful application of an ANN 
usually demands much experimentation. There are 
many parameters to set. Some of them involve ANN 
type, others the number of layers and nodes defining 



the architecture and the connection weights. Also the 
training data are an important factor, and a great deal 
of attention must be paid to the test data to make sure 
that the network will generalize correctly on data which 
has not been trained on. 

Feature selection, structure design, and weight 
training can be regarded as three search problems in 
the discrete space of subsets of data attributes, the 
discrete space of the possible ANN configurations, 
and the continuous space of the ANN parameters, 
respectively. 

Architecture design is crucial in the successful ap- 
plication of ANNs because it has a significant impact on 
their information-processing capabilities. Indeed, given 
a learning task, an ANN with only a few connections 
and linear nodes may not be able to perform the task at 
all, while an ANN with a large number of connections 
and nonlinear nodes may overfit noise in the training 
data and lack generalization. The main problem is 
that there is no systematic way to design an optimal 
architecture for a given task automatically. 

Several methods have been proposed to overcome 
these shortcomings. This chapter focuses on one of 
them, namely EANNs. One distinct feature of EANNs 
is their adaptability to a dynamic environment. EANNs 
can be regarded as a general framework for adaptive 
systems, i.e., systems that can change their architec- 
tures and learning rules appropriately without human 
intervention. 

In order to improve the performance of EAs, differ- 
ent selection schemes and genetic operators have been 
proposed in the literature. This kind of evolutionary 
learning for ANNs has also been introduced to reduce 
and, if possible, to avoid the problems of traditional 
gradient descent techniques, such as Backpropagation 
(BP), that lie in the trapping in local minima. EAs are 
known to be little sensitive to initial training condi- 
tions, due to their being global optimization methods, 
while a gradient descent algorithm can only find a local 
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optimum in a neighbourhood of the initial solution. 
EANNs provide a solution to these problems and an 
alternative for controlling network complexity. 

ANN design can be regarded as an optimization 
problem. Tettamanzi andTomassini (2001) presented a 
discussion about evolutionary systems and their interac- 
tion with neural and fuzzy systems, and Cantu-Paz and 
Kamath (2005) also described an empirical comparison 
of EAs and ANNs for classification problems. 



EVOLUTIONARY ARTIFICIAL NEURAL 
NETWORKS 

There are several approaches to evolve ANNs, that usu- 
ally fall into two broad categories : problem-independent 
and problem-dependent representation of EAs. The for- 
mer are based on a general representation, independent 
of the type and structure of the ANN sought for, and 
require the definition of an encoding scheme suitable 
for Genetic Algorithms (GAs). They can include map- 
ping between ANNs and binary representation, taking 
care of decoders or repair algorithms, but this task is 
not usually easy. 

The latter are EAs where chromosome representa- 
tion is a specific data structure that naturally maps to an 
ANN, to which appropriate genetic operators apply. 

EAs are used to perform various tasks, such as con- 
nection weight training, architecture design, learning 
rule adaptation, input feature selection, connection 
weight initialization, rule extraction from ANNs, etc. 
Three of them are considered as the most popular at 
the following levels: 

Connection weights concentrates just on weights 
optimization, assuming that the architecture of 
the network is given. The evolution of weights 
introduces an adaptive and global approach to 
training, especially in the reinforcement learning 
and recurrent network learning paradigm, where 
gradient-based training algorithms often experi- 
ence great difficulties. 

Learning rules can be regarded as a process 
of "learning how to learn" in ANNs where the 
adaptation of learning rules is achieved through 
evolution. It can also be regarded as an adaptive 
process of automatic discovery of novel learning 
rules. 



Architecture enables ANNs to adapt their topolo- 
gies to different tasks without human intervention. 
It also provides an approach to automatic ANN 
design as both weights and structures can be 
evolved. In this case a further subdivision can be 
made by defining a "pure" architecture evolution 
and a simultaneous evolution of both architecture 
and weights. 

Other approaches consider the evolution of transfer 
functions of an ANN and input feature selection, but 
they are usually applied in conjunction with one of the 
three methods above in order to obtain better results. 

The use of evolutionary learning for ANNs design 
is no more than two decades old. However, substan- 
tial work has been made in these years, whose main 
outcomes are presented below. 

Weight Optimization 

Evolution of weights may be regarded as an alterna- 
tive training algorithm. The primary motivation for 
using evolutionary techniques instead of traditional 
gradient-descent techniques such as BP, as reported 
by Rumelhart et al. (1986), lies in avoiding trapping 
in local minima and the requirement that the activa- 
tion function be differentiable. For this reason, rather 
than adapting weights based on local improvement 
only, EAs evolve weights based on the fitness of the 
whole network. 

Some approaches use GAs with real encodings 
for biases and weights, like in the work presented by 
Montana and Davis (1989); others used binary weights 
encoding at first, and then implemented a modified 
version with real encodings as Whitley et al. (1990). 
Mordaunt and Zalzala (2002) implemented a real num- 
ber representation to evolve weights, analyzing evolu- 
tion with mutation and a multi-point crossover, while 
Seiffert (2001) described an approach to completely 
substitute a traditional gradient descent algorithm by 
a GA in the training phase. 

Often, during the application of GAs, some prob- 
lems, e.g., premature convergence and stagnation of 
solution can occur as reported by Goldberg (1992). In 
order to solve this problem, an improved algorithm 
was proposed by Yang et al. (2002), where a genetic 
algorithm, based on evolutionary stable strategy, was 
implemented to keep the balance between population 
diversity and convergence speed during evolution. 
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Recently, a new GA was proposed by Pai (2004), 
where a genetic inheritance operator was implemented 
to determine the weights of E ANN, without considering 
mutation operators, but only two-point crossover for 
reproduction, applying it to decimal chromosomes. 

Learning Rule Optimization 

In supervised learning algorithms, standard BP is the 
most popular method for training multilayer networks. 
The design of training algorithms, in particular the 
learning rules used to adjust connection weights, 
depends on the type of ANN architecture considered. 
Several standard learning rules have been proposed, 
but designing an optimal learning rule becomes very 
difficult when there is little prior knowledge about the 
network topology, producing a very complex relation- 
ship between evolution and learning. The evolutionary 
approach becomes important in modelling the creative 
process since newly evolved learning rules can deal 
with a complex and dynamic environment. 

The first kind of optimization considers the adjust- 
ment of learning parameters and can be seen as the 
first attempt to evolve learning rules. They comprise 
BP parameters, like the learning rate and momentum, 
and genetic parameters, like mutation and crossover 
probabilities. Some works have been carried out by 
Merelo et al. (2002), that presented several solutions 
for the optimal learning parameters of multilayer com- 
petitive-learning neural networks. 

Considering learning-rule optimization, one of the 
first studies was conducted by Chalmers ( 1 990). He also 
noticed that discovering complex learning rules using 
GAs is not easy, due to the highly complex genetic 
coding used, making the search space large and hard to 
explore, while GAs used a simpler coding which allows 
known learning rules as a possibility, making the search 
very biased. In order to overcome these limitations, 
Chalmers suggested to apply GP, a particular kind of 
GA. Several studies have been carried out in this direc- 
tion and some of them are described, along with a new 
approach presented by Poli and Radi (2002). 

Architecture Optimization 

The design of an optimal architecture can be formulated 
as a search problem in the architecture space, where 
each point represents an ANN topology. As pointed out 
by Yao (1999), given some performance (optimality) 



criteria, e.g., minimum error, learning speed, lower 
complexity, etc., about architectures, the performance 
level of all these forms a surface in the design space. 
Several approaches have been carried out in this direc- 
tion. A neuro-evolutionary approach was presented by 
Miikkulainen and Stanley (2002), using augmenting 
topologies. It has been designed specifically to outper- 
form the solutions that employ a principled method of 
crossover of different topologies, to protect structural 
innovation using speciation, and to incrementally grow 
from minimal structure. 

Another work carried out by Wang et al. (2002), 
considered the definition of an optimal network that 
was based on the combination of constructing and 
pruning by GAs, while, more recently, Bevilacqua et 
al. (2006) presented a multi-objective GA approach to 
optimize the search for the optimal topology, based on 
Schema Theory. 

One of the most important forms of deception in 
ANNs structure optimization arises from the many- 
to-one and from one-to-many mapping from geno- 
types in the representation space to phenotypes in the 
evaluation space. The existence of networks function- 
ally equivalent and with different encodings makes 
evolution inefficient. This problem is termed as the 
competing convention problem. Other important issues 
involve representation and the definition of the EA. In 
the encoding phase, an important aspect is to decide 
how much information about architecture should be 
encoded into the genotype. Then, the performance of 
ANNs strongly depends on their topology, consider- 
ing size and structure, and, consequently, its defini- 
tion characterizes networks features like its learning 
process speed, learning precision, noise tolerance, and 
generalization capability. 

Transfer Function Optimization 

Transfer function perturbations can begin with a fixed 
function, as linear, sigmoidal or gaussian, and allow 
the GA to adapt to a useful combination according to 
the situation. Some work has been carried out by Yao 
and Liu (1996) in order to apply a transfer function 
adaptation over generations, and by Figueira and Poli 
(1999), with a GP algorithm evolving functions. 

To improve solutions, often, this kind of evolution 
is carried out together with the other kinds of ANNs 
optimizations, here described. 
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Input Data Selection 



OPEN ISSUES 



One of the most important factors for training neural 
networks is the availability and the integrity of data. 
They should represent all possible states of the problem 
considered, and they should have enough patterns for 
building also the test and validation set. 

The consistency of all data has to be guaranteed, and 
the training data must be representative of the problem, 

in order to avoid overfitting. 

Input data selection can be regarded as a search 
problem in the discrete space of the subsets of data 
attributes. The solution requires the removal of un- 
necessary, conflicting, overlapping and redundant 
features in order to maximize the classifier accuracy, 
compactness, and learning capabilities. 

Input data reduction has been approached by Reeves 
and Taylor (1998), who applied genetic algorithms to 
select training sets for a kind of ANNs, and by Castellani 
(2006), embedding the search for the optimal feature 
set into the training phase. 

Joint Evolution of Architecture and 
Weights 

The drawbacks related to individual architecture and 
weights evolutionary techniques can be overcome with 
approaches that consider their conjunction. 

The advantage of combining these two basic ele- 
ments of an ANN is that a completely functioning 
network can be evolved without any intervention by 
an expert. 

Several methods that evolve both the network 
structure and the connection weights were proposed 
in the literature. 

Castillo et al. (1999) presented a method to search 
for the optimal set of weights, the optimal topology 
and learning parameters using a GA for the network 
evolution and BP for network training, while Yao et 
al. (1997, 2003) implemented, respectively, an evolu- 
tionary system for evolving feedforward ANNs based 
on evolutionary programming, and, more recently, a 
novel constructive algorithm for training cooperative 
NN ensembles . Azzini and Tettamanzi (2006) presented 
a neuro-genetic approach for the joint optimization of 
network structures and weights, taking advantage of 
BP as a specialized decoder, and Pedrajas et al. (2003) 
proposed a cooperative co-evolutionary method for 
ANN design. 



There are still several open issues in EANNs re- 
search. 

Regarding connection weights, a critical aspect is 
that the structure has to be predetermined, giving some 
problems when such a topology is difficult to define 
in the first place. 

Also in learning rule evolution, the design of train- 
ing algorithms, in particular the learning rules, depends 
on the type of the network architecture. Therefore, the 
design of such rules can become very difficult when 
there is little prior knowledge about the network topol- 
ogy, giving a complex relationship between evolution 
and learning. 

The architecture evolution has an important impact 
on the neural network evolution, and the evolution of 
pure architecture presents difficulties in evaluating 
fitness accurately. 

The simultaneous evolution of architecture and 
weights is one of the most interesting evolutionary 
ANNs techniques, and nowadays it concerns useful 
solutions for ANN design. Different works are carried 
out in these directions and are still open issues. Some 
of them concern about the application of cooperative or 
competitive co-evolutionary approaches, some others 
regarding the design of NN ensembles. 



CONCLUSION 

This work present a survey of the state of the art of 
evolutionary systems investigated in these decades 
and presented in the literature. In particular, this work 
focuses on the application of evolutionary algorithms 
to neural network design optimization. 

Several approaches for NN evolution are presented, 
together with some related works, and for each method 
the most important features are presented together with 
their main advantages and shortcomings. 
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KEY TERMS 

Adaptive System: System able to adapt its behavior 
according to changes in its environment or in parts of 
the system itself. 

Artificial Neural Networks: Models inspired by 
the working of the brain, considered as a combination 
of neurons and synaptic connections, which are capable 
of transmitting data through multiple layers, giving a 
system able to solve different problems like pattern 
recognition and classification. 

Backpropagation Algorithm: A supervised learn- 
ing technique used for training ANNs. It is based on a 
set of recursive formulas for computing the gradient 
vector of the error function, that can be used in a first- 
order method like gradient descent. 



Error Backpropagation: Essentially a search 
procedure that attempts to minimize a whole network 
error function such as the sum of the squared error of 
the network output over a set of training input/output 
pairs. 

Evolutionary Algorithms: Algorithms based on 
models that consider 'artificial' or 'simulated' genetic 
evolution of individuals in a defined environment. They 
are a broad class of stochastic optimization algorithms, 
inspired by biology and in particular by those biological 
processes that allow populations of organisms to adapt 
to their surrounding environment: genetic inheritance 
and survival of the fittest. 

Evolutionary Artificial Neural Networks : Special 
class of artificial neural networks in which evolution 
is another fundamental form of adaptation in addi- 
tion to learning. They are represented by biologically 
inspired computational models that use evolutionary 
algorithms in conjunction with neural networks to 
solve problems. 

Evolutionary Computation: In computer science it 
is a subfield of artificial intelligence (more particularly 
computational intelligence) involving combinatorial 
optimization problems. Evolutionary computation 
defines the quite young field of the study of computa- 
tional systems based on the idea of natural evolution 
and adaptation. 

Multi-Layer Perceptrons (MLPs): Class of neural 
networks that consists of a feed-forward fully connected 
network with an input layer of neurons, one or more 
hidden layers and an output layer. The output value is 
obtained through the sequence of activation functions 
defined in each hidden layer. Usually, in this kind of 
network, the supervised learning process is the back- 
propagation algorithm. 

Search Space: Set of all possible situations of the 
problem that we want to solve could ever be in. 
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INTRODUCTION 

The importance of juice beverages in daily food hab- 
its makes juice authentication an important issue, for 
example, to avoid fraudulent practices. 

A successful classification model should address two 
important cornerstones of the quality control of juice- 
based beverages: to monitor the amount of juice and 
to monitor the amount (and nature) of other substances 
added to the beverages. Particularly, sugar addition is 
a common and simple adulteration, though difficult to 
characterize. Other adulteration methods, either alone 
or combined, include addition of water, pulp wash, 
cheaper juices, colorants, and other undeclared additives 
(intended to mimic the compositional profiles of pure 
juices) (Saavedra, Garcia, & Barbas, 2000). 



VARIABLE SELECTON BY MEANS OF 
EVOLUTIONARY TECHNIQUES 

This chapter presents several approaches to address the 
variable selection problem. All of them are based on 
evolutionary techniques. They can be divided into two 
groups. First group of techniques are based on different 
codifications of a traditional Genetic Algorithm (GA) 
population and different specifications for the evalu- 
ation function. Second group shows a modification 
in the traditional Genetic Algorithm to improve the 
generalization capability by adding a new population 
and an approach based on the evolution of subspecies 
into the genetic population. 



BACKGROUND 

A range of analytical techniques have been used to 
deal with authentication problems. These include high 
performance liquid chromatography (Yuan & Chen, 



1999), gas chromatography (Stober, Martin & Pep- 
pard, 1998) and isotopic methods (Jamin, Gonzalez, 
Remaud, Naulet & Martin, 1997). Unfortunately, they 
are expensive and slow. 

Infrared Spectrometry (IR) (Rodriguez-Saona, Fry, 
McLaughlin, & Calvey, 2001) is a fast and convenient 
technique to perform screening studies in order to assess 
the quantity of pure juice in commercial beverages. 
The interest lies in developing, from the spectroscopy 
data, classification methods that might enable the de- 
termination of the amount of natural juice contained 
in a sample. 

However, the information gathered from the IR 
analyses has some fuzzy characteristics (random 
noise, unclear chemical assignment, etc.), so analytical 
chemists tend to use techniques like Artificial Neural 
Networks (ANN) (Haykin, 1999) or develop ad-hoc 
classificationmodels. Previous studies (Gestal, Gomez- 
Carracedo, Andrade, Dorado, Fernandez, Prada, & 
Pazos, 2005) showed that ANN classified apple juice 
beverages according to the concentration of natural 
juice they contained and that ANN had advantages over 
classical statistical methods, such as robust models and 
easy application of the methodology on R&D labora- 
tories. Disappointingly, the large number of variables 
derived from IR spectrometry makes ANNs time-con- 
suming during training and, most important, makes it 
very difficult to establish relationships between these 
variables and the analytical knowledge. 

Several approaches were used to reduce the number 
of variables to a small subset, which should retain the 
classification capabilities of the overall dataset. Hence, 
the ANN training process and the interpretation of the 
results would be highly improved. 

Furthermore, previous variable selection would 
yield other advantages: cost reduction (if the classifi- 
cation model requires a reduced set of data, the time 
needed to obtain them will be shorter; increased effi- 
ciency (if the system processes less information, less 
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time for processing it will be required); understanding 
improvement (if two models resolve the same task, 
but one of them uses less information this would be 
more thoroughly interpreted. Therefore, the simpler 
the model, the easier the knowledge extraction and the 
easier the understanding, the easier the validation). 

In addition, it was proved the analysis of IR data 
involved a highly multimodal problem, as many com- 
binations of variables each (obtained using a different 
method) led to similar results when the samples were 
classified. 



GENETIC ALGORITHMS 

A GA (Holland, 1975)(Goldberg, 1989) is a recurrent 
and stochastic process that operates with a group of 
potential solutions to a problem, known as genetic 
population, based on one of the Darwin's principles: 
the survival of the best individuals (Darwin, 1859). 

Briefly a GA works as follows. Initially, a population 
of solutions is generated randomly and the solutions 
evolve continuously after consecutive stages of cross- 
overs and mutations. Every individual at the population 
has an associated value that quantifies associated its 
usefulness (adjustment or fitness), in accordance to its 
adequacy to solve the problem. This value has to be 
obtained for each potential solution and constitutes the 
quantitative information the evolutionary algorithm 
will use to guide the search. The process will continue 
until a predetermined stopping criterion is reached. This 
might be a particular threshold error for the solution or 
a certain number of generations (populations). 



Therefore, different basic steps will be required to 
implement a GA: codification of the problem, which 
results in a population structure, initialisation of the first 
population, defining a fitness function to evaluate how 
good is each individual to solve the problem and, finally, 
a cyclic procedure of reproductions and replacements 
(Michalewicz, 1999; Goldberg, 2002). 



DATA DESCRIPTION 

In the present practical application, the spectral range 
measured by IR spectrometry (wavenumbers from 1250 
cm-1 to 900 cm-1) provided 176 absorbances (which 
measured light absorption)(G6mez-Carracedo, Gestal, 
Dorado & Andrade, 2007). 

The main goal of the application consisted on the 
prediction of the amount of pure juice on a sample 
using absorbance values returned for the IR measure- 
ments. But the amount of data obtained for a sample 
by IR spectrometry is huge, so the direct application 
of mathematical and/or computational methods (al- 
though possible) requires a lot of time. Accordingly, it 
is important to establish whether all raw data provided 
relevant information for sample differentiation. Hence, 
the problem was an appropriate case for the use of 
variable selection techniques. 

Previous to variable selection construction of data 
sets for both model development and validation was 
required. Thus, samples with different amounts of pure 
apple juice were prepared at the laboratory. Besides, 
23 apple juice-based beverages sold in Spain were 
analysed (the declared amount of juice printed out on 



Table 1. Low and high concentrations dataset 



Juice Concentration 


2% 


4% 


6% 


8% 


10% 


16% 


20% 


Total 


Training 


19 


17 


16 


22 


21 


20 


19 


134 


Validation 


1 


1 


13 


6 


6 


6 


6 


39 


Commercial 














1 


1 





2 



Juice Concentration 


20% 


25% 


50% 


70% 


100% 


Total 


Training 


20 


19 


16 


14 


7 


86 


Validation 


6 


18 


13 


1 


6 


44 


Commercial 





2 








19 


21 
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their labels was used as input data). The samples were 
distributed in 2 ranges: samples containing less than 
20% of pure juice and samples with more than 20% 
of pure apple juice (Table 1). IR spectra were obtained 
for all the samples. 

This data was split in two datasets to extract the rules 
(ANN training) and to validate them. The commercial 
samples were used to further check the performance of 
the model. It is worth noting that whenever a predicted 
value does not match that given on the labels of the 
commercial products it might be owing to either a true 
wrong performance of the model (classification error) or 
an inaccurate labelling of the commercial beverage. 

Classification Test Considering all the 
Original Variables 

First test involved all variables given by IR spectros- 
copy. A dedicated ANN used all the absorbances of the 
training data to obtain a classification model. Later, the 
classification results obtained with this model will be 
used as a reference to compare the performance of the 
proposals over the same data. 

Different parametric classification techniques (PLS, 
SIMCA, Potential Curves, etc.) were used too (Ges- 
tal, Gomez-Carracedo, Andrade, Dorado, Fernandez, 
Prada, & Pazos, 2004), with very similar results. But 
the best results were achieved using ANN, which will 
be very useful to address the variable selection issue 
employing GA with fitness functions based on ANN. 
The accuracy of the reference method is shown in 
Table 2. 

An exhaustive study of the results of the different 
classification models allowed us to conclude that the 



set of samples with low (2-20%) percentages of juice 
was far more complex and difficult to classify than the 
samples with higher percentages (25-100%). Indeed, 
the number of errors was usually higher for the 2-20% 
range both in calibration and validation. Classification 
of the commercial samples agreed quite fine with the 
percentages of juice declared into the labels, but for 
a particular sample. When that sample was studied in 
detail it was observed that its spectrum was slightly 
different from the usual ones. This suggested that the 
juice contained an unusually high amount of added 
sugar(s). 



VARIABLE SELECTION APPROACHES 

Variable selection processes were performed to optimize 
the ANN classifications. 

First, two simple approaches will be briefly de- 
scribed: Pruned and Fixed Search. Both approaches are 
based on a traditional GA and both use ANN to evalu- 
ate fitness. As the results will show, both techniques 
offer good solutions but present a common problem: 
an execution of each method provides only a solution 
(discarding any other possible optimal one). This was 
addressed with the two more advanced approaches 
described in the next sections. 

Regardless of the variable selection approach, a 
GA will guide the search by evaluating the prediction 
capabilities of each ANN model developed employing 
different sets of IR variables. The problem that has 
to be solved is to find out a small set of IR variables 
that, when combined with an ANN model, classifies 




Table 2. Classification with ANNs using all the variables 



Low Concentrations 



Training (134) 
134 (100%) 



Validation (39) 
35 (89.74%) 



Comercial (2) 
2 (100%) 



ANN Configuration for low concentrations 

Topology: 176/50/80/5 learning rate: 0.0005 stop criterion: mse=5 or epochs= 500.000 



High Concentrations 



Training (86) Validation (44) Commercial (21) 

86 (100%) 43 (97.72%) 21 (100%) 



ANN Configuration 

Topology: 176/8/5/5 learning rate: 0.0005 stop criterion: mse-1 or epochs -500. 000 
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apple juice-based beverages properly (according to the 
amount of pure apple juice they contain). 

Any time a subset of IR variables is proposed, the 
associated absorbance values are used as input patterns 
to the ANN. So, ANN will consider as much input 
processing elements (PEs) as variables. The output 
layer has one PE per category (6 for the lower range 
and 5 for the high ones). After several previous trials 
considering several hidden layers (from 1 to 4), each 
with different PEs (from 1 to 50), a compromise was 
established between the final fitness level reached by 
the ANN and the time required to its training. This 
compromise was essential because although better 
results were obtained with more hidden layers the time 
required for training was much higher as well. Then, 
it was decided not to extensively train the net but to 
get a good approximation to its real performance and 
elucidate whether the input variables are really suitable 
to accurately classify the samples. 

The goal is to determine which solutions, among 
those provided by the GA, represent good starting 
points to perform more exhaustive training. Therefore, 
it would be enough to extend the ANN learning up to 
the point where it starts to converge. For this particular 
problem, convergence started after 800 cycles; to war- 
rant it, 1000 iterations were fixed. 

Next, each of the most promising solutions are used 
as inputs to an external ANN intended to provide the 
final classification results. 

Pruned Search 

This approach starts by considering all variables where 
from groups of variables are gradually discarded. 
The GA will steadily reduce the amount of variables 
that characterise the objects, until an optimal subset 
that allows for an overall satisfactory classification is 
obtained. This is used to classify the samples, and the 
results are used to determine how relevant the discarded 
wavenumbers were for the classification. This process 
can be continued as long as the classification results are 
equal, or at least similar, to those obtained using the 
overall set of variables. Therefore, the GA determines 
how many and which wavenumbers will be selected 
for the classification. 

In this approach, each individual in the genetic 
population is described by n genes, each representing 
one variable. With a binary encoding each gene might 
be either or 1, indicating whether the gene is active or 



not and, therefore, if the variable should be considered 
for classification. 

The evaluation function has to guide the pruning 
process on getting individuals with a low number of 
variables. To achieve this, the function should help 
those individuals that, besides classifying accurately, 
make use of fewer variables. In this particular case a 
factor proportional to the percentage of active genes 
was defined to multiply the MSE obtained by the 
ANN, so that individuals with less active genes - and 
with a similar classification performance - will have 
a higher fitness and consequently a higher probability 
of survival. 

Table 3 shows the results of several runs of Pruned 
Search approach. It is worth noting that each solution 
was extracted from a different execution and that, within 
the same execution, only one solution provided valid 
classification rates. As it can be noted, classification 
results were very similar although the variables used 
to perform the classifications were different. 

These ANN models were obtained are slightly worse 
than those obtained using 1 76 wavenumbers, although 
the generalization capabilities of the best ANNs model 
were quite satisfactory as there was only one error when 
the commercial beverages were classified. 

Fixed Search 

This approach uses a real codification in the chromo- 
some of the genetic individuals. The genetic popula- 
tion consists of individuals with n genes, where n 
is the amount of wavenumbers that are considered 
sufficient for the classification according to some 
external criterion. Each gene represents one of the 
176 wavenumbers considered in the IR spectra. The 
GA will find out the n-variables subsets yielding the 
best classification models. The number of variables is 
predefined in the genotype. 

As the final number of variables has to be decided 
in advance, some external criterion is needed. In order 
to simplify comparison of the results, the final number 
of variables was defined by the minimum number of 
principal components which can describe our data set, 
they were two. 

Since the amount of wavenumbers remains constant 
in the course of the selection process, this approach 
defines the fitness of each genetic individual as the 
mean square error reached by the ANN at the end of 
the training process. 
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Table 3. Classification with pruned search 



Low Concentrations 


Training (134) 


Validation (39) 


Comercial (2) 


Runl: Selected Variables 

[42 77] 


129 (96.27%) 


23 (58.97%) 


1 (50%) 


Run2: Selected Variables 
[52 141] 


115 (85.82%) 


22 (56.41%) 


(0%) 


Run3: Selected Variables 
[102 129] 


124 (92.54%) 


25 (64.10%) 


(0%) 


ANN Configuration for low concentrations 
Topology: 2/10/60/7 learning rate: 0.0001 


stop criterion: mse- 


=5 or epochs=500.000 


High Concentrations 


86 


43 


21 


Runl: Selected Variables 

[42 77] 


83 (96.51%) 


34 (79.07%) 


21 (100%) 


Run2: Selected Variables 
[52 141] 


82 (95,35%) 


35 (81.40%) 


20 (95.24%) 


Run3: Selected Variables 
[102 129] 


83 (96.51%) 


33 (76.74%) 


21 (100%) 


ANN Configuration 
Topology: 2 / 10 / 60 / 5 


learning rate: 0.0001 


stop criterion: mse z 


=lorepochs=500.000 




Table 4 shows the results of several runs of Pruned 
Search approach. Again, note that each run provides 
only a solution. 

A problem was that the genetic individuals contains 
only two genes. Hence, any crossover operator will use 
the unique available crossing point (between the genes) 
and only half of the information from each parent will 
be transmitted to its offspring. This converts the Fixed 
Search approach into a "random search" when only 
two genes constituted the chromosome. 

Hybrid Two-Population Genetic 
Algorithm 

The main disadvantage of non-multimodal approaches 
is that they discard local optimal solutions because 
a final or global solution is preferred. But there are 
situations where the final model has to be extracted 



after analysing different similar solutions of the same 
problem. 

For example, after analysing the different solutions 
provided by one execution of the classification task with 
Hybrid Two-Population Genetic Algorithm (Rabunal, 
Dorado, Gestal & Pedreira, 2005) it was obtained three 
valid (and similar) models (Table 5). Furthermore the 
results were clearly superior to those obtained with the 
previous alternatives. But, the most important, it was 
observed that the solutions concentrated along specific 
spectral areas (around the 88 wavenumber). It would not 
be possible with the previous approaches. This approach 
will be studied deeply on a dedicated chapter (Finding 
multiple solutions with GA in multimodal problems) 
and no more details will be presented here. 



585 



Evolutionary Approaches to Variable Selection 



Table 4. Classification with fixed search 



Low Concentrations 


Training (134) 


Validation (39) 


Comercial (2) 


Runl: Selected Variables 
[12 159] 


125 (93.28%) 


23 (58.97%) 


(0%) 


Run2: Selected Variables 
[23 67] 


129 (96.27%) 


26 (66.66%) 


1 (50%) 


Run3: Selected Variables 
[102 129] 


129 (96.27%) 


25 (64.10%) 


(0%) 


ANN Configuration for low concentrations 
Topology: 2/10/60/7 learning rate: 0.0001 


stop criterion: mse 


=5 or epochs=500.000 


High Concentrations 


86 


43 


21 


Runl: Selected Variables 
[12 159] 


82 (95.35%) 


31 (72.09%) 


19 (90.47%) 


Run2: Selected Variables 
[23 67] 


81 (94.18%) 


31 (72.09%) 


19 (90.47%) 


Run3: Selected Variables 
[102 129] 


81 (94.18%) 


33 (76.74%) 


21 (100%) 


ANN Configuration 
Topology: 2 / 10 / 60 / 5 


learning rate: 0.0001 


stop criterion: mse- 


=1 or epochs= 500.000 



FUTURE TRENDS 



CONCLUSION 



A next natural stage should be to consider multimodal 
approaches. Evolutionary Computation provides useful 
tools like fitness sharing, crowding... which should 
be compared to the hybrid two populations approach. 
Research is needed to implement criteria to allow the 
GA to stop when a satisfactory low number of variables 
is found. 

Another suitable option should be include more 
scientific information within the system to guide the 
search. For example, a final user would provide a de- 
scription of the ideal solution in terms of efficiency, data 
acquisition cost, simplicity, etc. All these parameters 
may be used as targets in a multiobjective Genetic 
Algorithm intended to provide the best variable subset 
complying with all the requirements. 



Several conclusions can be drawn for variable selec- 
tion tasks: 

First, satisfactory classification results can be ob- 
tained from reduced sets of variables extracted using 
quite different techniques, all based on the combination 
of GA and ANN. 

Best results were obtained using a multimodal GA 
(the hybrid two population approach), as it was expected, 
based on its ability to maintain the genetic individuals 
homogeneously distributed over the search space. Such 
diversity not only induces the appearance of optimal 
solutions, but also avoids the search to stop on a local 
minimum. This option does not provide only a solution 
but a group of them, with similar fitness. This allows 
scientists to select a solution with a sound chemical 
background and extract additional information. 
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Table 5. Classification with hybrid two-population genetic algorithm 



Low Concentrations 


Training (134) 


Validation (39) 


Comercial (2) 


Runl: Selected Variables 
[89 102] 


127 (95.77%) 


29 (74.36%) 


(0%) 


Runl: Selected Variables 
[87 102] 


130 (97.01%) 


28 (71.79%) 


(0%) 


Runl: Selected Variables 
[88 89] 


120 (89.55%) 


29 (74.36%) 


(0%) 


ANN Configuration for low concentrations 
Topology: 2/10/60/7 learning rate: 0.001 


stop criterion: mse- 


=2 or epochs= 500.000 


High Concentrations 


86 


43 


21 


Runl: Selected Variables 
[89 102] 


83 (96.51%) 


35 (81.39%) 


21 (100%) 


Runl: Selected Variables 
[87 102] 


83 (96.51%) 


36 (83.72%) 


21 (100%) 


Runl: Selected Variables 
[88 89] 


82 (95.35%) 


39 (90.69%) 


21 (100%) 


ANN Configuration 
Topology: 2 / 10 / 60 / 5 


learning rate: 0.001 


stop criterion: mse z 


=2 or epochs= 500.000 




REFERENCES 

Darwin, C. (1859). On the Origin of Species by Means 
of Natural Selection. 

Gestal, M., Gomez-Carracedo, MR, Andrade, J.M., 
Dorado, J., Fernandez, E., Prada, D., Pazos, A. (2005). 
Selection of variables by Genetic Algorithms to Clas- 
sify Apple Beverages by Artificial Neural Networks. 
Applied Artificial Intelligence. 181-198. 

Gestal, M., Gomez-Carracedo, M.P., Andrade, J.M., 
Dorado, J., Fernandez, E., Prada, D., Pazos, A. (2004). 
Classification of Apple Beverages using Artificial 
Neural Networks with Previous Variable selection. 
Analytica Chimica Acta. 225-234. 

Goldberg, D. (1989). Genetic Algorithms in Search, 
Optimization and Machine Learning. Kluwer Academic 
Publishers, Boston. 



Goldberg, D. (2002). The Design of Innovation: Lessons 
from and for Competent Genetic Algorithms . Addison- 
Wesley, Reading, MA. 

Gomez-Carracedo, M.R, Gestal, M., Dorado, J., & 
Andrade, J.M. (2007). Linking chemical knowledge 
and Genetic Algorithms using two populations and 
focused multimodal search. Chemometrics and intelli- 
gent laboratory systems. (87) 173-184. 

Haykin, S. (1999) Neural Networks: A Comprehensive 
Foundation, 2 nd Edition. Prentice Hall. 

Holland, J. ( 1 97 5). Adaptation in Natural and Artificial 
Systems. University of Michigan Press, ANN Arbor 

Jamin, E., Gonzalez, J., Remaud, G., Naulet, N. & 
Martin, G. (1997). Detection of exogenous sugars or 
organic acids addition in pineapple juices and concen- 
trates by 13C IRMS analysis. Journal of Agricultural 
and Food Chemestry. (45), 3961-3967. 



587 



Evolutionary Approaches to Variable Selection 



Michalewicz, Z. (1999). Genetic Algorithms + Data 
Structures = Evolution Programs. Springer- Verlag. 

Rabunal, J.R., Dorado, J., Gestal, M., & Pedreira, N. 
(2005). Diversity and Multimodal Search with a Hybrid 
Two-Population GA: an Application to ANN Develop- 
ment. Lecture Notes in Computer Science. 382-390. 

Rodriguez-Saona, L.E, Fry, F.S., McLaughlin, M.A., 
& Calvey, E.M. (2001). Rapid analysis of sugars in 
fruit juices by FT-NIR spectroscopy. Carbohydrate 
Research. (336) 63-74. 

Saavedra, L., Garcia, A., Barbas, C. (2000). Develop- 
ment and validation of a capillary electrophoresis me- 
thod for direct measurement of isocitric, citric, tartaric 
and malic acids as adulteration markers in orange juice. 
Journal of Chromatography. (881)1-2, 395-401. 

Stober, P., Martin, G.G., & Peppard, T.L. (1998). 
Quantitation of the undeclared addition of industrially 
produced sugar syrups to fruit by juices capillary gas 
chromatography. Deutsche Lebensmittel-Rundschau. 
(94)309-316. " 

Yuan, J.P. & Chen, F. (1999). Simultaneous separation 
and determination of sugars, ascorbic acid and furanic 
compounds by HPLC-dual detection. Food Chemistry. 
(64) 423^27. 



KEY TERMS 

Absorbance: Function (usually logarithmic) of the 
percentage of transmission of a wavelength of light 
through a liquid. 



Artificial Neural Network: Interconnected group 
of artificial neurons that uses a mathematical or com- 
putational model for information processing. They are 
based on the function of biologic neurons. It involves 
a group of simple processing elements (the neurons) 
which can exhibit complex global behaviour, as result 
of the connections between the neurons. 

Evolutionary Technique: Technique which pro- 
vides solutions for a problem guided by biological 
principles such as the survival of the fittest. These 
techniques start from a randomly generated population 
which evolves by means of crossover and mutation 
operations to provide the final solution. 

Knowledge Extraction: Explicitation of the internal 
knowledge of a system or set of data in a way that is 
easily interpretable by the user. 

Search Space: Set of all possible situations of the 
problem that we want to solve could ever be in. Com- 
bination of all the possible values for all the variables 
related with the problem. 

Spectroscopy (Spectrometry): Production, mea- 
surement and analysis of electromagnetic spectra 
produced as a result of the interactions between elec- 
tromagnetic radiation and matter, such as emission or 
absorption of energy. 

Spectrum: Intensity of a electromagnetic radiation 
across a range of wavelengths. It represents the intensity 
of emitted or transmitted energy versus the energy of 
the received light. 

Variable Selection : Selection of a subset of relevant 
variables (features) which can describe a set of data. 
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INTRODUCTION 

Wireless ad-hoc networks are infrastructureless net- 
works in which heterogeneous capable nodes assemble 
together and start communicating without any backbone 
support. These networks can be made truly dynamic 
and the nodes in these networks can move about freely 
while connecting and disconnecting with other nodes 
in the network. This property of ad-hoc networks to 
self-organize and communicate without any extrinsic 
support gives them tremendous flexibility and makes 
them perfect for applications such as emergencies, 
crisis-management, military and healthcare. 

For example, in case of emergencies such as earth- 
quakes, often most of the existing wired network 
infrastructure gets destroyed. In addition, since most 
of the wireless networks such as GSM and IEEE 
802.11 wireless LAN use wired infrastructure as their 
backbone, often they are also rendered useless. In such 
scenarios, ad-hoc networks can be deployed swiftly 
and used for coordinating relief and rescue operations. 
Ad-hoc networks can be used for communication be- 
tween various stations in the battle-field, where setting 
up a wired or an infrastructure-based network is often 
considered impractical. 

Though a lot of research has been done on ad-hoc 
networks, a lot of problems such as security, quality- 
of-service (QoS) and multicasting need to be addressed 
satisfactorily before ad-hoc networks can move out of 
the labs and provide a flexible and cheap networking 
solution. 

Evolutionary computing algorithms are a class of 
bio-inspired computing algorithms. Bio-inspired com- 
puting refers to the collection of algorithms that use 
techniques learnt from natural biological phenomena 



and implement them to solve a mathematical problem 
(Olario & Zomaya, 2006). Natural phenomena such as 
evolution, genetics, and collective behavior of social 
organisms and functioning of a mammalian brain teach 
us a variety of techniques that can be effectively em- 
ployed to solve problems in computer science which 
are inherently tough. 

In this Chapter and the chapter entitled, "Swarm 
Intelligence Approach for Wireless Ad Hoc Networks" 
of this book, we present some of the currently available 
important implementations of bio-inspired computing 
in the field of ad-hoc networks. This chapter looks at 
the problem of optimal clustering in ad-hoc networks 
and its solution using Genetic Programming (GP) 
approach. The chapter entitled, "Swarm Intelligence 
Approaches for Wireless Ad Hoc Networks" of this 
book, continues the same spirit and explains the use 
of the principles underlying Ant Colony Optimization 
(ACO) for routing in ad-hoc networks. 



BACKGROUND 

The first infrastructureless network was implemented 
as packet radio (Toh, 2002). It was initiated by the 
Defense Advanced Research Projects Agency (DARPA) 
in 1970s. By this time the ALOHA project (McQuil- 
lan & Walden, 1977) at the University of Hawaii had 
demonstrated the feasibility of using broadcasting 
for sending / receiving the data packets in single-hop 
radio networks. ALOHA later led to the develop- 
ment of Packet Radio Network (PRNET), which 
was a multi-hop multiple-access network, under the 
sponsorship of Advanced Research Projects Agency 
(ARPA). PRNET had the design objectives similar to 
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the current day ad-hoc networks such as flow and error 
control over multi-hop communication route, deriving 
and maintaining network topology information and 
mechanism to handle router mobility and power and 
size requirements, among others. However, since the 
electronic devices were huge then, the packet radios 
were not easily movable, leading to limited mobility. 
In addition, the network coverage was slow and since 
Bellman-Ford's shortest path algorithm was used for 
routing, transient loops were present. Since then, a lot 
of research has been done on ad-hoc networks and a 
number of routing algorithms have been developed 
which provide far greater performances and are loop 
free. The rapid development of silicon technology 
has also led to ever shrinking devices with increasing 
computation power. Ad-hoc networks are deliberated 
for use in medical, relief-and-rescue, office environ- 
ments, personal networking, and many other daily life 
applications. 

Bio-inspired algorithms, to which the evolution- 
ary computing approaches such as genetic algorithms 
belong, have been around for more than past 50 years. 
In fact, there have been evidences that suggest that the 
Artificial Neural Networks are rooted in the unpub- 
lished works of A. Turing (related to the popular Turing 
Machine) (Paun, 2004). Finite Automata Theory was 
developed about a decade after that based on the neural 
modeling. This ultimately led to the area that is currently 
known as Neural Computing. This can effectively be 
called as the initiation of bio-inspired computing. Since 
then, techniques such as GP, Swarm Intelligence and 
Ant Colony Optimization and DNA computing have 
been researched and developed as nature continues to 
inspire us and show us the way for solving the most 
complex problems known to man. 



MAIN FOCUS OF THE CHAPTER 

This chapter and the the chapter entitled, "Swarm Intel- 
ligence Approaches for Wireless Ad Hoc Networks" 
of this book, in combination, present an introduction 
to various bio-inspired algorithms and describe their 
implementation in the area of wireless ad-hoc networks. 
This chapter primarily presents the GP approach to ad- 
hoc networks. We first give a general introduction to GP 
and explain the concepts of genes and chromosomes. 
We also explain the stochastic nature of GP and the 
process of mutation and crossover to provide optimal 



solution to the problem using the GP approach. We 
then present the Weighted Clustering Algorithm given 
by (Chatterjee, Das & Turgut, 2002) that are used for 
clustering of nodes in mobile ad-hoc networks, as an 
instantiation of this approach. 



GP 



GP is a popular bio-inspired computing method. The 
concepts in genetic algorithms are inspired by the 
phenomenon of life and evolution itself. Life is a 
problem whose solution includes retaining those who 
have strong enough characteristics to survive in the 
environment and discarding the others. This exquisite 
process can provide solutions to complex analytical 
problems awaiting the most "fitting" result. 

The basics include the role of chromosomes and 
genes. Chromosomes carry the genes which contain the 
parameters/characteristics that need to be optimized. 
Hence, GP starts with declaration of data structures that 
form the digital chromosomes. These digital chromo- 
somes contain genetic information where each gene 
represents a parameter to be optimized. The gene could 
be represented as a single bit, which could be ' V (ON) 
or '0' (OFF). So, a chromosome is a sequence of l's and 
0's and a parameter is either totally present or totally 
absent. Other abstractions could represent the presence 
of a parameter in relative levels. For instance, a gene 
could be represented using 5 bits where the magnitude 
of the binary number tells about the magnitude of the 
presence of a parameter in the range '0' (00000) to 
'31' (11111). 

First, these digital chromosomes are created using 
stochastic means. Then their fitness is tested either 
by static calculation of fitness using some method or 
dynamically by modelling fights between the chromo- 
somes. The chromosomes with a set level of fitness 
are retained and allowed to produce a new generation 
of chromosomes. This can be done either by genetic 
recombination that is new chromosomes are produced 
with combination of present chromosomes, or by 
mutation, that is new chromosomes are produced by 
randomly producing changes in present chromosomes. 
This process of testing for fitness and creating new 
generations is repeated until the fittest chromosomes 
are deemed as optimized enough for the task, which 
the genetic algorithm was created for. The process is 
described in Figure 1. 
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Figure 1. Genetic algorithm 




11001100, 
11101101, 
11111110, 
10101010... 



11001101, 
11101100, 
11101110, .. 
Update to 
population 




F(110Q110Q)=0.98 
F(111Q1101) = 0.89 
F(11111110) = 0.76 
F(10101010) = 0,52 



11111110 




11101110 



Genetic algorithms begin with a stochastic process 
and arrive at an optimized solution and are time con- 
suming.. Hence, they are generally used for solving 
complex problems. As mentioned by Ashby (1962), self 
organization is one of these complex problems. Self 
organization is the problem where the components of 
a multi-component system achieve (or try to achieve) 
a common goal without any centralized or distributed 
control. The organization is generally done by chang- 
ing the direct environment which can be adapted by 
the various system components and hence affect the 
behaviour of these components. 

As has been mentioned, "Self -organization is es- 
pecially important in ad-hoc networking because of 
the spontaneous interaction of multiple heterogeneous 
components over wireless radio connections without 
human interaction" (Murthy & S., 2004). 



Dressier (2006) gives the following list of self 
organization capabilities: 

Self-healing: The system should be able to detect 
and repair the failures cause by overloading, com- 
ponent malfunctioning or system breakdown. 
Self-configuration: The system should be able 
to generate adequate configurations including 
connectivity, quality of service etc. as required 
in the existing scenario 

Self-management: The system should be able to 
maintain devices and in turn the network depend- 
ing on the set configuration parameters. 
Self-optimization: The system should be to make 
an optimal choice of methods depending on the 
system behaviour . 
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Adaptation: The system should dynamically 
adapt to the changing environment conditions, 
for example, change in node positions, change 
in number of nodes in the network etc. 

Genetic algorithms have been used extensively in 
robotics, electronic circuit design, natural language 
processing, game theory, multi-model search, com- 
puter network topology design among many other 
applications. 

In the context of wireless ad-hoc networks, genetic 
algorithms have been used in solving shortest path 
routing problem (Ahn & Ramakrishna, 2002), QoS 
path discovery (Fu, Li & Zhang, 2005), developing 
broadcasting strategy (Alba, Dorronsoro, Luna, Nebro 
& Bouvry, 2005), QoS routing (Barolli, Koyama & 
Shiratori, 2003), among others. In the interest of brevity, 
we present below only one representative application 
of the use of genetic algorithms to solve problems in 
ad-hoc networks. 

Weighted Clustering Using Genetic 
Algorithm 

Nodes in ad-hoc networks are sometimes grouped into 
various clusters with each cluster having a cluster-head. 
Clustering aims at introducing a kind of organisation 
in ad-hoc networks. This leads to better scalability of 
networks resulting in better utilisation of resources in 
larger networks. A cluster-head is responsible for the 
formation of clusters and maintenance of clusters in 
ad-hoc networks. Several clustering mechanisms have 
been proposed for ad-hoc networks, like Lowest-ID 
(Ephremides, Wieselthier & Baker, 1987), Highest 
Connectivity (Gerla & Tsai, 1995) Distributed Mobil- 
ity-Adaptive Clustering (DMAC) (Basagni, 1999), 
Distributed Dynamic Clustering Algorithm (McDonald 
& Znati, 1 999) and Weight-Based Adaptive Clustering 
Algorithm (WBACA) (Dhurandher & Singh, 2007). 
Weighted Clustering Algorithm (WCA) (Chat- 
tel] ee, Das and Turgut, 2002) is a popular clustering 
algorithm which selects a cluster-head on the basis of 
node mobility, battery-power, connectivity, distance 
from neighbour and degree of connectivity. Weights are 
assigned to each parameter and a combined weighted 
metric W y is calculated as shown by Equation (1) 
(Chatterjee, Das and Turgut, 2002). Within certain 
constraints, nodes with minimum W are selected as 

' V 

the cluster-head. 



W = wA + w D + wM + wP 

v lv 2 v 3v 4 v 



(1) 



In Equation (1), 



A v signifies the difference between optimal and actual 
connectivity of a node. 

D y signifies the sum of distances with all the neigh- 
bouring nodes. 

M y is the running average of the speed of the node. 

P signifies the time that the node has acted as a clus- 

V ° 

ter-head. 
w i> W 2> W 3 an< ^ W 4 are re l at i ve weights given to different 
parameters. 

It should be noted that (Chatterjee, Das and Turgut, 
2002): 



w x + w 2 + w 3 + w 4 = 1 



(2) 



A genetic algorithm-based approach was presented 
by the designers of WCA (Chatterjee, Das, & Turgut, 
2002). The sub-optimal solutions are mapped to chro- 
mosomes and given as input to produce best solutions 
using genetic techniques. This leads to better perfor- 
mance and more evenly balanced load sharing. The basic 
building blocks of the algorithm are given below: 

Initial Population: A candidate solution which acts 
as a chromosome can be represented as shown 
in Figure 6. This initial population set is gener- 
ated randomly by arranging the nodes in strings 
and then traversing this string. A node which 
is not a cluster-head and is not a neighbour of 
a cluster-head (and hence, part of an existing 
cluster) is chosen as a cluster head if it has less 
than 8 (a pre-defined constant) neighbours. 5 is 
chosen such as to prevent a cluster-head with more 
than optimal neighbour causing over-loading at 
cluster-heads. 

Objective Fitness Function: Fitness value of a chro- 
mosome can be calculated as the sum of W of the 

V 

contained genes. All nodes present at a gene [1] 
are analysed. If a node is not a cluster-head or a 
member of a cluster-head and has a node degree 
less than MAXDEGREE, it is added to the clus- 
ter-head list and its W value is added to the total 

V 

sum. For remaining nodes in the network, if the 
node is not a cluster-head or a member of other 
cluster-head, its W value is added to the sum and 
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the node is added to the cluster-head list. Lesser 
the sum of the W y values of the genes, the higher 
is the fitness value of the chromosome. 
Crossover: The crossover rate is 80 %. The authors 
used a technique called XOverl (Chatterjee, 



Das, & Turgut, 2002) as the crossover technique 
for the genetic implementation. 
Mutation: Mutation introduces randomness into the 
solution space. If mutation rate is low, there is a 
chance of the solution converging to a non-optimal 




Figure 2. Candidate solution as chromosome and its single gene (Based On: Turgut, Das, Elmasri, & Turgut, 
2002, Fig. 3) 



1% 



; 



Neighboring Nodes of Cluster- 




ed 



<57=> 65. 



Figure 3. Genetic algorithm for weighted clustering in ad-hoc networks 1 



Algo_genetic_cluster { 

Generate initial population randomly, with population size = number of 
nodes 
do{ 

while(new_pool_size < old_pool_size){ 

select chromosomes using Roulette wheel 



method; 



apply crossover using X_Overl; 
apply mutation by swapping genes; 
find fitness value of all the chromosomes; 
replace through appending; 
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solution space. Hence, mutation is a very impor- 
tant step in genetic computations. In the stated 
algorithm, the inventors used swapping for the 
process of mutation. Two genes of a chromosome 
were selected randomly and swapped. Mutation 
rate used is 10%. 
Selection Criteria: As mentioned earlier, chromo- 
somes with lower values of W are considered 

V 

fitter. Roulette wheel method is used for selection 
in accordance with the fitness values of these 
chromosomes. 

Elitism: If the new generation produced has fitness value 
better than the best fitness value of the previous 
generation, than the best solution is replaced by 
the new generation. Since the best solutions of 
a generation are replaced, this step helps avoid 
the local maxima of the solution space and move 
towards the global maxima. 

Replacement: This method states that, during replace- 
ment, the best solution of a generation is appended 
to the solution set of the next generation. This 
step helps in preserving the best solution during 
genetic operations. 

Using these building functions, the genetic algorithm 
for the weighted clustering algorithm was given. This 
algorithm is given in Figure 3. 



CONCLUSION 

This article presented an overview of an evolutionary 
computing approach using GP and their application to 
wireless ad-hoc networks, both of which are currently 
"hot" topics amongst the computer science and network- 
ing research community. The intention of writing this 
article was to show how one could "marry" together 
concepts from GP with ad-hoc networks to arrive at 
interesting results. In particular we reviewed the WCA 
algorithm (Chatterjee, Das and Turgut, 2002) which 
uses GP for clustering of nodes in ad-hoc networks. 
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KEY TERMS 

Bio-Inspired Algorithms: Group of algorithms 
modelled on the observed natural phenomena which 
are employed to solve mathematical problems. 

Chromosome: A proposed solution to a problem 
which is represented as a string of bits and can mutate 
and cross-over to create a new solution. 

Clustering: Grouping the nodes of an ad hoc net- 
work such that each group is a self-organized entity 



having a cluster-head which is responsible for formation 
and management of its cluster. 

Cross-Over: Genetic operation which produces a 
new chromosome by combining the genes of two or 
more parent chromosome. 

Fitness Function: A function which maps the sub- 
jective property of a fitness of a solution to an objective 
value which can be used to arrange different solutions 
in the order of their suitability as final or intermediate 
solution. 

Genes: Genes are building blocks of chromo- 
somes and represent the parameters that need to be 
optimized. 

Genetic Algorithms: The algorithms that are 
modelled on the natural process of evolution. These 
algorithms employ methods such as crossover, muta- 
tion and natural selection and provide the best possible 
solutions after analyzing a group of sub-optimal solu- 
tions which are provided as inputs. 

Initial Population: Set of sub-optimal solutions 
which are provided as inputs to a genetic algorithm 
and from which an optimal solution evolves. 

Mobile Ad-Hoc Network: A multi-hop network 
formed by a group of mobile nodes which co-operate 
among each other to achieve communication, without 
requiring any supporting infrastructure. 

Mutation: Genetic operation which randomly alters 
a chromosome to produce a new chromosome adding 
new solution to the solution-set. 



ENDNOTE 

1 Based on Chatterjee, Das and Turgut, 2002 
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INTRODUCTION 

Grammatical Inference (also known as grammar 
induction) is the problem of learning a grammar for 
a language from a set of examples. In a broad sense, 
some data is presented to the learner that should return 
a grammar capable of explaining to some extent the 
input data. The grammar inferred from data can then 
be used to classify unseen data or provide some suit- 
able model for it. 

The classical formalization of Grammatical Infer- 
ence (GI) is known as Language Identification in the 
Limit (Gold, 1967). Here, there are a finite set S + of 
strings known to belong to the language L (the posi- 
tive examples) and another finite set S of strings not 
belonging to L (the negative examples). The language 
L is said to be identifiable in the limit if there exists a 
procedure to find a grammar G such that S + cz L(G), 
S <£ L(G) and, in the limit, for sufficiently large S + 
and S , L = L(G). The disjoint sets S + and S are given 
to provide clues for the inference of the production 
rules P of the unknown grammar G used to generate 
the language L. 

Grammatical inference include such diverse fields 
as speech and natural language processing, gene analy- 
sis, pattern recognition, image processing, sequence 
prediction, information retrieval, cryptography, and 
many more. An excellent source for a state-of-the art 
overview of the subject is provided in (de la Higuera, 
2005). 

Traditionally, most work in GI has been focused 
on the inference of regular grammars trying to induce 
finite-state automata, which can be efficiently learned. 
For context free languages some recent approaches have 
shown limited success (Starckie, Costie & Zaanen, 
2004), because the search space of possible grammars 
is infinite. Basically, the parenthesis and palindrome 
languages are common test cases for the effectiveness 
of grammatical inference methods. Both languages are 



context-free. The parenthesis language is deterministic 
but the palindrome language is nondeterministic (de 
la Higuera, 2005). 

The use of evolutionary methods for context-free 
grammatical inference are not new, but only a few at- 
tempts have been successful. 

Wyard (1991) used Genetic Algorithm (GA) to 
infer grammars for the language of correctly balanced 
and nested parentheses with success, but fails on the 
language of sentences containing the same number of 
a's and b's (a n b n language). In another attempt (Wyard, 
1994), he obtained positive results on the inference 
of two classes of context-free grammars: the class of 
n-symbol palindromes with 2 < n < 4 and a class of 
small natural language grammars. 

Sen and Janakiraman (1992) applied a GA using a 
pushdown automata to the inference and successfully 
learned the a n b n language and the parentheses balancing 
problem. But their approach does not scale well. 

Huijsen (1994) applied GA to infer context-free 
grammars for the parentheses balancing problem, the lan- 
guage of equal numbers of a 's and b 's and the even-length 
2-symbol palindromes. Huijsen uses a "markerbased" 
encoding scheme with has the main advantage of al- 
lowing variable length chromosomes. The inference 
of regular grammars was successful but the inference 
of context-free grammars failed. 

Those results obtained in earlier attempts using GA 
to context-free grammatical inference were limited. 
The first attempt to use Genetic Programming (GP) 
for grammatical inference used a pushdown automata 
(Dunay, 1994) and successfully learned the parenthesis 
language, but failed for the a n b n language. 

Korkmaz and Ucoluk (2001) also presented a GP 
approach using a prototype theory, which provides a 
way to recognize similarity between the grammars in 
the population. With this representation, it is possible to 
recognize the so-called building blocks but the results 
are preliminary. 
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Javed and his colleagues (2004) proposed a Genetic 
Programming (GP) approach with grammar-specific 
heuristic operators with non-random construction of the 
initial grammar population. Their approach succeeded 
in inducing small context-free grammars. 

More recently, Rodrigues and Lopes (2006) pro- 
posed a hybrid GP approach that uses a confusion 
matrix to compute the fitness. They also proposed a 
local search mechanism that uses information obtained 
from the sentence parsing to generate a set of useful 
productions. The system was used for the parenthesis 
and palindromes languages with success. 



BACKGROUND 

A formal language is usually defined as follows. Given 
a finite alphabet E of symbols, we define the set of all 
strings (including the empty string e) over X as £*. 
Thus, we want to learn a language L cz Z*. The alphabet 
X could be a set of characters or a set of words. The 
most common way to define a language is based on 
grammars which gives rules for combining symbols 
and to produce the all sentences of a language. 

A grammar is defined by a quadruple G = (IV, Z, P, 
S), where N is an alphabet of nonterminal symbols, Z 
is an alphabet of terminal symbols such that N nZ = 
([), P is a finite set of production rules of the form a — » 
P for a, p e ( JV u X )* where * represents the set of 
symbols that can be formed by taking any number of 
them, possibly with repetitions. S is a special nonter- 
minal symbol called the start symbol. 

The language L(G) produced from grammar G 
is the set of all strings consisting only of terminal 
symbols that can be derived from the start symbol S 
by the application of production rules. The process of 
deriving strings by applying productions requires the 
definition of a new relation symbol =>. Let aXp be a 
string of terminals and nonterminals, where X is a non- 
terminal. That is, a and p are strings in ( N u X )*, and 
X g JV. If X -> cp is a production of G, we can say aXp 
=^> acpp. It is important to say that one derivation step 
can replace any nonterminal anywhere in the string. 
We may extend the =^> relationship to represent one 
or many derivation steps. We use a * to denote more 
steps. Therefore, we formally define the language 
L(G) produced from grammar G as L(G) = { w | w g 
X*, S^*w}. 



More details about formal languages and gram- 
mars can be found in textbooks such as Hopcroft et 
al (2001). 

The Chomsky Hierarchy 

Grammars are classified according to the form of the 
production rules used. They are commonly grouped 
into a hierarchy of four classes, known as the Chomsky 
hierarchy (Chomsky, 1957). 

Recursively enumerable languages: a grammar 
is unrestricted, and its productions may replace 
any number of grammar symbols by any other 
number of grammar symbols. The productions 
are of the form a -> p with a, p g ( N ul)'. 
Context-sensitive languages: they have grammars 
with productions that replace a single nonterminal 
by a string of symbols, whenever the nonterminal 
occurs in a specific context, i.e., has certain left 
and right neighbors. These productions are of the 
form aAy -» aPy, with A g N and a, p, y g ( N 
u X )*. A is replaced by p if it occurs between a 
andy . 

Context-free languages: in this type, grammars 
have productions that replace a single nonterminal 
by a string of symbols, regardless of this nonter- 
minal's context. The productions are of the form 
A^ a for A e IV and a e (JVu £ )*; thus Ahas 
no context. 

Regular languages: they have grammars in which 
a production may only replace a single nontermi- 
nal by another nonterminal and a terminal. The 
productions are of the form A — » Bot or A — » aB 
for A, B g IV and a g Z*. 

It is sometimes useful to write a grammar in a 
particular form. The most commonly used in gram- 
matical inference is the Chomsky Normal Form. A 
CFG G is in Chomsky Normal Form (CNF) if all 
production rules are of the form A — » BC or A — » a for 
A, B, C g IV and a g Z. 

The Cocke-Younger-Kasami Algorithm 

To determine whether a string can be generated by 
a given context-free grammar in CNF, the Cocke- 
Younger-Kasami (CYK) algorithm can be used. This 
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algorithm is efficient and it has complexity 0(n 3 ) where 
n is the sentence length . 

In the C YK algorithm, first a triangular table that tells 
whether the string w is in L(G) is constructed. The hori- 
zontal line corresponds to the positions of the string w = 
a 1 a .. a . The table entry V is the set of variables A 

1 2 n J rs 



e P such that A =^>* a a 



a . We are interested in 



whether the start symbol S is in the set V because 



a i a 2 



In 

or S 



w, 



that is the same as saying S 
i. e., w g L(G). 

To fill the table, we work row-by-row upwards. Each 
row corresponds to one length of substrings; the bottom 
row is for strings of length 1, the second-from-bottom 
row for strings of length 2 and so on, until the top row 
corresponds to the one substring of length n which is 
w itself. The pseudocode is in Figure 1. 



Genetic Programming 

Genetic Programming (GP) is an evolutionary technique 
used to search over a huge state space of structured 
representations (computer programs). Each program 
represents a possible solution written in some language. 
The GP algorithm can be summarized in Figure 2 
(Koza, 1992). 

The evaluation of a solution is accomplished by 
using a set of training examples known as fitness cases 
which, in turn, is composed by sets of input and output 
data. Usually, the fitness is a measure of the deviation 
between the expected output for each input and the 
computed value given by GP (Banzhaf, Nordin, Keller 
& Francone, 2001). 



Figure 1. The CYK algorithm 



For r = 1 to n do 






V rl = { A | A -> a r e P ] 






For s = 2 to n do 






For r = 1 to n-s+1 do 






V„ = 






For k = 1 to s-1 do 






V„ = V C3 u [ A | A -^ BC e P, 


B e V rV . and C e V r + k , 


r. k 



Figure 2. The GP algorithm 



Generate Initial Population Randomly 
While not (Stopping Condition) 
begin 

Evaluate the fitness of each individual 
Select individuals according to their fitness 
Modify them by applying genetic operators 
end 
Returns the best individual found 
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There are two main selection methods used in GP: 
fitness proportionate and tournament selection. In the 
fitness proportionate selection, programs are selected 
randomly with probability proportional to its fitness. In 
the tournament selection, a fixed number of programs 
are taken randomly from the population and the one 
with the best fitness in this group is chosen. In this 
work, we use the tournament selection. 

Reproduction is a genetic operator that simply 
copies a program to the next generation. Crossover, 
on the other hand, combines parts of two individuals 
to create two new ones. Mutation changes randomly 
a small part of an individual. 

Each run of the main loop of GP creates a new 
generation of computer programs that substitutes the 
previous one. The evolution is stopped when a satis- 
factory solution is achieved or a predefined maximum 
number of generations is reached. 



A GRAMMAR GENETIC 
PROGRAMMING APPROACH 

We present how a GP approach can be applied to the 
inference of context-free grammars. First, we discuss 
the representation of the grammars. The modification 
needed in the genetic operators are also presented. In the 
last section, the grammar evaluation are discussed. 

Initial Population 

It is possible to represent a CFG as a list of struc- 
tured trees. Each tree represents a production with its 



left-hand side as a root and the derivations as leaves. 
Figure 3 shows the grammar G = ( iV, Z, P, S) with 
Z = (a, b} , N = {S, A} and P = {S -> AS ; S -> b; 
A^SA;A^a}. 

The initial population can be created with random 
productions, provided that all the productions are reach- 
able direct or indirectly starting with S. 

Genetic Operators 

The crossover operator is applied over a pair of gram- 
mars and works as follows. First, a production is chosen 
using a tournament selection. If the second grammar 
has no production with the same left-hand side of the 
production chosen, crossover is rejected. Otherwise, 
the productions are swapped. 

The mutation operation is applied to a single selected 
grammar. A production is then chosen using the same 
mechanism of crossover. A new production, with the 
same left-hand side and with a randomly right-side, 
replaces the production chosen. 

The crossover probability is usually high («90%) 
and the mutation probability is usually low («10%). 

Unfortunately, using only the genetic operators 
mentioned, the convergence of the algorithm is not 
guaranteed. In our recently work, we demonstrated 
that the use of two local search operators is needed: 
an incremental learning operator (Rodrigues & Lopes, 
2006) and an expansion operator (Rodrigues & Lopes, 
2007). The first uses the information obtained from a 
CYK table to discover which production is missing 
to cover the sentence. The latter can expand the set of 
productions dynamically providing diversity. 




Figure 3. An example of a CFG represented as a list of structured trees 




599 



Evolutionary Grammatical Inference 



The Incremental Learning Operator 

This operator is applied before the evaluation of each 
grammar in the population. It uses the CYK table 
obtained from the parsing of positive examples to 
allow the creation of an useful new production. The 
pseudocode is in Figure 4. 

Once this process is completed with success, hope- 
fully, there will be a set of positive examples (possible 
all) recognized by the grammar. Although, there is no 
warranty that some negative examples will still remain 
being rejected by the grammar. 

The Expansion Operator 

This operator adds a new nonterminal to the grammar 
and generates a new production with this new nontermi- 
nal as a left-side. This new approach allows grammars 
to grow dynamically in size. To avoid a new useless 
production, a production with another non-terminal in 
the left-side and the new non-terminal in the right-side 
is generated. It is important to emphasize that the new 
operator adds two productions to the grammar. 

This operator promotes diversity in the population 
that is required in the beginning of the evolutionary 
process. 



Grammar Evaluation 

In grammatical inference, we need to train the system 
with both positive and negative examples to avoid 
overgeneralization. Usually the evaluation is done 
counting the positive examples covered by a grammar 
in proportion to the total of positive examples. If the 
grammar cover some negative examples, it is penal- 
ized in some way. 

In our recently work, we use a confusion matrix 
that is typically used in supervised learning (Witten & 
Frank, 2005). Each column of the matrix represents 
the number of instances predicted either positively or 
negatively, while each row represents real classification 
of the instances. The entries in the confusion matrix have 
the following meaning in the context of our study: 

TP is the number of positive instances recognized 

by the grammar. 

TN is the number of negative instances rejected 

by the grammar. 

FP is the number of negative instances recognized 

by the grammar. 

FN is the number of positive instances rejected 

by the grammar. 



Figure 4. The incremental learning operator pseudocode 



For each positive example 

Construct t.he CYK table with V„ 
Tf the example is not recognized 
If Vm is not empty 

Then clone the root production changing 
the lett-hand side by S. 
Else 

If V 5 , n i is not empty 

Then add production S — > Vn V2, n -x 

For s n-1 to n/7. do 
begin 

If V X a is not empty 

Then If V i+ i, n -s is not empty 

Then add production S -> V.. : V :: .. j;i .. 
end 
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There are a several measures that can be obtained 
from the confusion matrix. The most common is total 
accuracy that is obtained from the total of correct 
classified examples divided by the total number of 
instances. In this paper we used two other measures: 
specificity (Equation 1) and sensitivity (Equation 2). 
These measures evaluate how positive and negative 
examples are correctly recognized by the classifier. 



specificity : 



sensitivity 



TN 



TN + FP 



TP 



TP + FN 



(1) 



(2) 



The fitness is computed by the product of these 
measures leading to a balanced heuristic. This fit- 
ness measure was proposed by (Lopes, Coutinho & 
Lima, 1998) and widely used in many classification 
problems. 

The use of confusion matrix provides a better 
evaluation of the grammars in the population, because 
grammars with the same accuracy rate usually has dif- 
ferent values for specificity and sensitivity. 



FUTURE TRENDS 

The GP approach for the grammatical inference is based 
on the CYK algorithm and the confusion matrix. The 
preliminary results are promising but there are two 
problems that must be addressed. 

The first is the that the solution found is not nec- 
essarily the smallest one. Depending on the run, the 
grammar inferred varies in size and, sometimes, it 
can be difficult to understand and may have useless 
or redundant production rules. Further work will focus 
on devising a mechanism able to favor shorter partial 
solutions. 

The second is called "bloat", the uncontrolled growth 
of the size of an individual in the population (Monsieurs 
& Flerackers, 2001). The use of an expansion operator 
may cause this undesirable behavior. Nevertheless, this 
behavior was not detected in the experiments because all 
useless productions are eliminated during the search. 



CONCLUSION 

This article proposes a GP approach for context-free 
grammar inference. In this approach, an individual is 
a list of structured trees representing their productions 
with their left-hand side as the root and the derivations 
as leaves. It uses a local search operator, named Incre- 
mental Learning, capable of adjusting each grammar 
according to the positive examples. It also uses an 
expansion operator which adds a new production to 
the grammar allowing the grammars to grow in size. 
This operator promotes diversity in the population that 
is required in the earlier generations. 

The use of a local search mechanism that is capable 
of learning from examples promotes a fast convergence. 
The preliminary results demonstrated that the approach 
is promising. 
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KEY TERMS 

CYK: A Cocke-Younger-Kasami algorithm used 
to determine whether the sentence can be generated 
by the grammar. 

Evolutionary Computation: Large and diverse 
class of population-based search algorithms that is 
inspired by the process of biological evolution through 
selection, mutation and recombination. They are itera- 
tive algorithms that start with an initial population of 
candidate solutions and then repeatedly apply a series 
of the genetic operators. 

Finite Automata: A model of behavior composed 
of a finite number of states, transitions between those 
states, and actions. They are used to recognize regular 
languages. 

Genetic Algorithm: A type of evolutionary com- 
putation algorithm in which candidate solutions are 
represented typically by vectors of integers orbit strings, 
that is, by vectors of binary values and 1. 

Heuristic: Function used for making certain deci- 
sions within an algorithm; in the context of search algo- 
rithms, typically used for guiding the search process. 

Local Search: Atype of search method that starts at 
some point in search space and iteratively moves from 
position to neighbouring position using heuristics. 

Pushdown Automata: A finite automaton that can 
make use of a stack containing data. They are used to 
recognize context-free language. 

Search Space: Set of all candidate solutions of a 
given problem instance. 
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INTRODUCTION 

Evolutionary Robotics is a field of Autonomous 

Robotics where the controllers that implement behav- 
iours are obtained through some kind of Evolutionary 
Algorithm. The aim behind this technique is to obtain 
controllers minimizing human intervention. This is 
very interesting in order to achieve complex behaviours 
without introducing a "human bias". Sensors, body 
and actuators are usually different for a human being 
and for a robot, so it is reasonable to think that the best 
strategy obtained by the human designer is not neces- 
sarily the best one for the robot. This article will briefly 
describe Evolutionary Robotics and its advantages over 
other approaches to Autonomous Robotics as well as 
its problems and drawbacks. 



BACKGROUND 

The firsts modern attempts to obtain a robot that could 
be called "autonomous", that is, with the ability of 
adapting to a non predefined environment and perform 
its tasks adequately, are from the late sixties and they 
basically tried to reproduce human reasoning in the 
robot. The reasoning process was divided into several 
steps (input data interpretation, environment modelling, 
planning and execution) that were performed sequen- 
tially. As time passed, robots were getting better thanks 
to better design and construction, more computational 
capabilities and improvements in the Artificial Intel- 
ligence techniques employed. But also some problems 
appeared and remained there: lack of reaction in real 
time, inability to handle dynamic environments and 
unmanaged complexity as desired behaviours become 
more complex. 

In the late eighties a new approach, called Behav- 
iour Based Robotics, was introduced. It emphasized 
the behaviour, no matter how it was obtained, as op- 



posed to traditional (knowledge based) Autonomous 
Robotics where the emphasis was on modelling the 
knowledge needed to perform the behaviour. This new 
approach proposes a direct connection between sensors 
and actuators with no explicit environment modelling. 
Behaviour Based Robotics has proven to be very useful 
when implementing low level behaviours, but it has 
also shown problems when scaling to more complex 
behaviours. Phil Husbands (Phil Husbands et al., 1 994) 
and Dave Cliff (Cliff et al., 1993a) have shown that 
it is not easy to design a system that connects sensors 
and actuators in order to achieve complex behaviours. 
Regardless of whether the system is monolithic (to 
design a complex system in just one step is never easy) 
or modular the design problem is difficult basically 
due to the fact that the possible interactions between 
modules grow exponentially. An additional problem is 
that human designed controllers for autonomous robots 
are not necessarily the best choice, sometimes they are 
simply not a good choice. A human designer cannot 
avoid perceiving the world with its own sensors and 
developing solutions for problems taking into account 
the perceptions and the actuations he / she can perform. 
Furthermore, humans tend to simplify and modularize 
problems and this is not always possible in complex 
environments. 

Due to these drawbacks, in the early nineties some 
researchers started to use Evolutionary Algorithms in 
order to automatically obtain controllers for autonomous 
robots leading to a new robotics field: Evolutionary 
Robotics. Some examples of these research line are 
the papers by Irman Harvey (Harvey et al., 1 993), Phil 
Husbands (Husbands et al, 1994), Dave Cliff (Cliff 
et al., 1993a) and Randall Beer and John Gallagher 
(Randall Beer and John Gallagher, 1992). The idea is 
very simple and very promising and, again, has shown 
it is very effective with simple behaviours. But, even if 
it solves some problems, it also has its own problems 
when dealing with complex behaviours. In the next sec- 
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tion we will talk about those problems and, in general, 
about the main aspects to take into account when using 
evolution in Autonomous Robotics. 



EVOLUTIONARY ROBOTICS 

The basis of Evolutionary Robotics is to use evolution- 
ary algorithms to automatically obtain robot control- 
lers. In order to do that, there are many decisions to 
be made. First of all, one must decide what to evolve 
(controllers, morphology, both?). Then, whatever is 
to be evolved has to be encoded in chromosomes. An 
evolutionary algorithm must be chosen. It has to be 
decided where and how to evaluate each individual, 
etc. These issues will be addressed in the following 
sections. 

What to Evolve 

The first decision to be made is what to evolve. The most 
common choice is to evolve controllers for a given robot, 
but we can also evolve the morphology or both things 
together. If we choose to evolve only the controllers, 
we also have to decide how they will be implemented. 
The most usual choices are artificial neural networks, 
fuzzy logic systems and classifier systems. 

Classifier systems are made up of rules (the clas- 
sifier set). Each rule consists of a set of conditions 
and a message. If the conditions are accomplished, a 
message can produce an action on an actuator and is 
stored in a message list. Sensor values are also stored 
in this message list. Messages in the message list may 
change the state of conditions, leading to a different set 
of activated rules. There is an apportionment of credit 
system that changes the strength for each rule and a rule 
discovery system, where a genetic algorithm generates 
new rules using existing rules in the classifier set and 
their strength. An example of classifier systems is the 
work of Dorigo and Colombetti (Colombetti et al, 1 996), 
(Dorigo and Colombetti, 1993, 1995, 1998). 

Fuzzy logic has also been used to encode control- 
lers. Possible sensed values and acting values are 
encoded into predefined fuzzy sets and the rules that 
relate both things can be evolved. Examples: (Cooper, 
1995), (Hoffmann and Pfister, 1994), (Vicente Matel- 
lanetal, 1998). 

Artificial neural networks are the most common way 
of implementing controllers in evolutionary robotics. 



On one hand, they are noise and failure tolerant and, 
on the other, they can be used as universal function 
approximators and can be easily integrated with an 
evolutionary algorithm to obtain a controller from 
scratch. Many researchers have used ANNs, just to men- 
tion some of them: (Beer and Gallagher, 1992), (Cliff 
et al, 1992), (Floreano and Mondada, 1998), (Harvey 
et al, 1993), (Kodjabachian and Meyer, 1995), (Lund 
and Hallam, 1 996), (Nolfi et al, 1 994) and (Santos and 
Duro, 1998). 

How to Encode What We are Evolving 

When encoding a controller into the chromosome, the 
most obvious choice, and the most common one, is to 
make a direct encoding. That is, each controller param- 
eter becomes a gene in the chromosome. For instance, 
if the controller is an ANN, each synaptic weight as 
well as the biases and other possible parameters that 
describe the ANN topology correspond to a gene, (Ma- 
taric and Cliff, 1996), (Miglino et al, 1995a). This can 
lead to very large chromosomes, as the chromosome 
size grows proportional to the square of the network 
size (in case of feedforward networks), increasing 
the dimensionality of the search space and making it 
more difficult to obtain a solution in reasonable time. 
Another problem is that the designer has to predefine 
the full topology (size, number of neurons, etc.) of 
the ANN, which is, in general, not obvious usually 
leading to a trial and error procedure. To address this 
problem, some researchers employ encoding schemes 
where the chromosome length may vary in time (Cliff 
et al., 1993b). 

Another possibility is to encode elements that, 
following a set of rules, encode the development of 
the individual (Guillot and Meyer, 1997), (Angelo 
Cangelosi et al, 1994), (Kodjabachian and Meyer, 
1998). Some authors even simultaneously evolve with 
this system both the controller and the morphology, 
but mostly for virtual organisms (Sims, 1994) or very 
simplified real robots. 

Where to Carry Out the Evolution 
Process 

To determine how good an individual is, it is necessary 
to evaluate this individual in an environment during a 
given time interval. This evaluation has to be performed 
more than once in order to make the process indepen- 
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dent from the initial conditions. The more complex the 
behaviour is the more time that is required to evaluate 
an individual. The evaluation of the individual is usually 
the most time consuming phase, by far, in the whole 
evolutionary process for evolutionary robotics. The 
evaluation can be carried out in a real environment, in 
a simulated environment or both. 

Evaluation in a real environment has the obvious 
advantage that the controllers obtained will work with- 
out problems in the real robot. But it is much slower 
than evaluation in simulated environments, it presents 
the danger of harming the robot and many limitations 
on the evaluation functions that may be used. These 
is why researchers that consider evaluation in a real 
environment mostly use small robots in controlled en- 
vironments (Dario Floreano and Francesco Mondada, 
1995, 1996, 1998). 

The alternative is to perform the evaluation in a 
simulated environment (Beer and Gallagher, 1992), 
(Cliff et al, 1993a), (Meeden, 1996), (Miglino et al, 
1995a, b). This is faster and it permits parallelizing the 
algorithm and using fitness functions that are impossible 
in a real environment. Nevertheless, it has the additional 
problem of how to carry out this simulation in order to 
obtain controllers that work in the real world. Jakobi 
(Jakobi 1 997) formalized this problem and established 
the conditions to be taken into account in order to suc- 
cessfully transfer the controllers obtained in simulation 
to the real robot. 

Some researchers choose to perform a simulated 
evaluation for the different generations of the evolu- 
tionary process except in the last generations where 
a real evaluation is performed to make sure that the 
controllers work in the real robot. Nevertheless, this ap- 
proach presents the same problems, although somehow 
reduced, as the case of evolution in a real environment 
(Miglino et al, 1995a). 



it is very difficult to decide beforehand the fitness of 
each action towards a final objective. Sometimes the 
same action may be good or bad depending on what 
happened before or after in the context. In addition, this 
approach implies an external determination of good- 
ness as it is the designer who is imposing through these 
action-fitness pairs how the robot must act. 

The global approach, on the other hand, implies 
defining a fitness criteria based on how good the robot 
was at achieving its final goal. The designer does not 
specify how good each action is in order to maximize 
fitness, and the evolutionary algorithm has a lot more 
freedom for discovering the best final controller. The 
main problem here is that there is a lot less knowledge 
injected in the evolutionary process, and thus the evo- 
lution may find loopholes in the problem specification 
and thus maximize fitness without really achieving the 
function we seek. 

There are two main ways in which global fitness 
can be obtained: external and internal. By external we 
mean a fitness assigned by someone or some process 
outside the robot. An extreme example of external 
global evaluation of the robot behaviour is presented 
in (Lund et al, 1998). 

The other possible approach is to employ an in- 
ternal representation of fitness. This is, employ clues 
in the environment the robot is conscious of and that 
it can use in order to judge its level of fitness without 
the help from any external evaluator (apart from the 
environment itself). A concept often used in order to 
implement this approach is that of internal energy. 
The robot has an internal energy level and this energy 
level increases or decreases according to a set of rules 
or functions related to robot perceptions. The final 
fitness of the robot is given by the level of energy at 
the end of its life. 




How to Evaluate 



FUTURE TRENDS 



Once the previous choices have been made, it is nec- 
essary to decide how the fitness will be calculated. 
There are two different perspectives for doing this: a 
local perspective or a global perspective (Mondada and 
Floreano, 1995). The first one consists in establishing 
for each step of the robot life a fitness of its actions in 
relation to its goal. The final fitness will be the sum of the 
fitness values corresponding to each step. This strategy 
presents two main drawbacks. Except in toy problems, 



The main problem of the evolutionary approach to 
robotics is that, although it is easy to obtain simple 
behaviours, it does not scale well to really complex 
behaviours without introducing some human knowledge 
in the system. An obvious solution to this problem is the 
reduction of evaluation time, which is slowly happen- 
ing thanks to the increasing computational capabilities. 
Another way of controlling this problem is to try to 
obtain a better encoding scheme using development. 
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Another problem that arises when trying to minimize 
human intervention is that the evolutionary algorithm 
can lead to a behaviour that optimizes the defined fitness 
function but the result does not correspond with the 
desired behaviour due to an incomplete or inadequate 
fitness function. This problem must be addressed from 
a theoretical and formal point of view so that the ap- 
propriate fitness functions can be methodologically 
obtained. 

Finally, there is a problem in common with Au- 
tonomous Robotics: benchmarking. It is not easy to 
compare different approaches and usually we can only 
say if a behaviour is satisfactory or not. It is hard to 
compare it with other similar behaviours or to know 
if the results would be better by changing something 
in the approach followed. 



CONCLUSION 

Evolutionary Robotics has quickly developed from 
its birth in the beginning of the nineties as the most 
common way of obtaining behaviours in Behaviour 
Based Robotics. It has shown that it is the easiest way 
to automatically obtain controllers for autonomous 
robots while trying to minimize the human factor, at 
least when the behaviours are not too complex, due 
to the fact that it is easy to encode every tool used to 
implement controllers and obtain a solution by just de- 
fining a fitness function. Inside Evolutionary Robotics, 
simulated evaluation is the preferred way of evaluating 
how good a candidate solution is and Artificial Neural 
Networks are the preferred tool to implement control- 
lers due to their noise and failure tolerance and their 
nature as a universal function approximator. Due to the 
time required to evaluate individuals and the number of 
individuals that must be evaluated, the selection of the 
correct evolutionary algorithm and its parallelization 
are critical factors. 
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KEY TERMS 

Artificial Neural Network: An interconnected 
group of artificial neurons, which are elements that use 
a mathematical model that reproduce, through a great 
simplification, the behaviour of a real neuron, used for 
distributed information processing. They are inspired 
by nature in order to achieve some characteristics pre- 
sented in the real neural networks, such as error and 
noise tolerance, generalization capabilities, etc. 

Autonomous Robotics: The field of Robotics that 
tries to obtain controllers for robots so that they are tol- 
erant and may adapt to changes in the environment. 

Behaviour Based Robotics: The field of Autono- 
mous Robotics that proposes not to pay attention to 
the knowledge that leads to behaviours, but just to 
implement them somehow. It also proposes a direct 
connection between effectors and actuators for every 
controller running in the robot, eliminating the typical 
sensor interpretation, world modelling and planning 
stages. 
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Evolutionary Algorithm: Stochastic population 
based search algorithm inspired on natural evolution. 
The problem is encoded in an n-dimensional search 
space where individuals represent candidate solutions. 
Better individuals have higher reproduction probabili- 
ties than worse individuals, thus allowing the fitness of 
the population to increase through the generations. 

Evolutionary Robotics: The field of Autonomous 
Robotics, usually also considered as a field of Behaviour 
Based Robotics, that obtains the controllers using some 
kind of evolutionary algorithm. 



Knowledge Based Robotics: The field of Autono- 
mous Robotics that tries to achieve "intelligent" behav- 
iours through the modelling of the environment and a 
process of planning over that model, that is, modelling 
the knowledge that generates the behaviour. 

MacroEvolutionary Algorithm: Evolutionary 
algorithm using the concept of species instead of indi- 
viduals. Thus, low fitness species become extinct and 
new species appear to fill their place. Its evolution is 
smoother, slower but less inclined to fall into local op- 
tima as compared to other evolutionary algorithms. 
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INTRODUCTION 

Traditionally physical systems have been designed 
by engineers using complex collections of rules and 
principles. The design process is top-down in nature 
and begins with a precise specification. This contrasts 
very strongly with the mechanisms which have pro- 
duced the extraordinary diversity and sophistication of 
living creatures. In this case the "designs" are evolved 
by a process of natural selection. The design starts as 
a set of instructions encoded in the DNA whose cod- 
ing regions are first transcribed into RNA in the cell 
nucleus and then later translated into proteins in the 
cell cytoplasm. The DNA carries the instructions for 
building molecules using sequences of amino acids. 
Eventually after a number of extraordinarily complex 
and subtle biochemical reactions an entire living or- 
ganism is created. The survivability of the organism 
can be seen as a process of assembling a larger system 
from a number of component parts and then testing the 
organism in the environment in which it finds itself 
(Miller, 2000). 

The main target of the evolvable hardware is to build 
a digital circuit using bio inspired methods like genetic 
algorithms. Here the potential solutions are coded like 
configuration vectors which command interconnection 
between logical cells inside the reconfigurable circuit. 
All configuration vectors represent the genotype and 
one single configuration vector is the individual with 
its own characteristics (like chromosome). 

The individuals are generated by genetic operators 
like crossover or mutation. One individual give one 
solution circuit which is tested in evaluation module. 
The circuit obtained from the individual consist the 



phenotype. The circuit behavior is compared with target 
functions, which we desire to implement. The result is 
fitness: if the circuit approximates the behavior of the 
target function, we have a good fitness for the individual 
which generate the circuit. Then each individual whit 
its fitness gets into selection module where the future 
parents in crossover and mutation are decided. Finally 
we have a circuit solution which implements the target 
function. We have an evolved synthesis of digital circuit 
- a method like assemble and test. 

This method can be useful because explore the design 
space beyond the limits imposed by traditional design 
methods. Two research directions are developed in 
evolvable hardware. In extrinsic evolvable hardware the 
individuals are obtained from software implementation 
on computer and phenotype consist in high level abstract 
circuits like SPICE object files or FPGA configuration 
files (.bit). The intrinsic evolution, on the other hand, 
supposes that entire evolution process is inside one or 
more chips (FPGA): the hardware implementation of 
evolved hardware. 

The challenge is to design an intrinsic evolution 
because can be used for applications like robots con- 
trol system. But this involves implementation of the 
software based algorithms in hardware modules. 



BACKGROUND 

The dynamic reconfigurable hardware area and 
evolvable hardware knows, in the last years, a fast 
evolution. Ten years ago the digital circuit implemen- 
tation, with high degree of complexity involve more 
problems caused specially by technologies limits. The 
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market was up most by complex programmable gates 
array or by the low grains field programmable gates 
array where upon the main problems are the number 
of Boolean cells available on chip and the delay time. 
The fast evolution of the technologies increases in our 
day the performance of programmable circuits. Thus, 
today is possible to implement a high speed central 
processing units core which is comparable with the 
application specific integrated circuit implementations. 
Therewith the low product costs make that a modern 
programmable digital circuit can be purchased by end 
users like students and researchers. Thus an evolution 
in designing, synthesis and implementation techniques 
with programmable logic circuits is required. One very 
attractive direction of research is implementation of 
hardware bio-inspired systems on programmable logic 
circuits like neural networks or evolutionary algorithms 
(e.g. genetic algorithms). 

The first research direction in this area is to find solu- 
tions for improve the genetic algorithm performance by 
hardware implementation. Software implementations 
have the advantage of flexibility and easily configura- 
tion. However, the convergence speed is slow because 
the serial execution of the steps whiles the algorithm 
run. To increase the speed a parallel implementation of 
the modules is required (Goldberg, 1995). This is done 
by hardware implementation on programmable logic 
circuits. More investigations are done in this area. 

The second research direction is to join the concept 
of assemble-and-test together with an evolutionary 
algorithm to gradually improve the quality of a design 
has largely been adopted in the nascent field of evolv- 
able hardware where the task is to build an electronic 
circuit. 

Thompson (Thompson, 1999) makes the first re- 
search, uses a reconfigurable platform and showed that 
is possible to evolve a circuit which could discriminate 
between two square wave signals. He demonstrates 
that is possible to design digital circuit using evolved 
algorithm. The evolutionary process had utilized physi- 
cal proprieties of the underlying silicon substrate to 
produce an efficient circuit. He argued that artificial 
evolution can produce design for electronics circuit 
which lies outside the scope of conventional meth- 
ods. Koza (Koza, 1997) have pioneered the extrinsic 
evolution of analogue circuit using SPICE and have 
automatically generated circuit which are competitive 
with human designer. 



Most workers are content with the extrinsic evolution 
because the evolutionary algorithm is software - based. 
But Scott and others achieve different solutions for 
hardware implementation of evolutionary algorithms. In 
(Scott, 1997) is design a hardware genetic algorithm 
implemented as pipe line hardware structure in FPGA. 
His work is a demonstration that full integrated evolved 
hardware (intrinsic) solution can be implemented. To 
design hardware modules for crossover, selection and 
mutation are used combinational networks such as 
systolic arrays presented by (Bland2001). 

Miller et al. (Miller,2000) give a reference with 
his work concerned of the evolution of combinational 
digital circuit to implement arithmetic functions. First 
he uses the gates networks which evolve in arithmetic 
circuit but demonstrate that is possible to use evolu- 
tion of some sub-circuits to achieve more complex 
circuits. 

Another example of extrinsic evolved synthesis 
is give by Martin's work (Martin 2001). Here the 
phenotype is give by a hardware description language 
(HandleC) sequences. 

But utilization of genetic algorithm in hardware 
design is in any more areas. In (Shaaban 2001) is used 
in integrated circuit design in semiconductor detectors. 
Yasunaga (Yasunaga 2001) use a hardware synthesis 
of digital circuit in reconfigurable programmable logic 
array structure to implement speach recognition system. 
Recently evovable synthesis is used for sequential 
circuit design(Ali 2004) or in digital filters design 
(Iana2006). 

The evolved combinational and sequential synthesis 
for digital circuits is included as control module for 
self-contained mobile system to execute more tasks 
like obstacle avoidance and target hit in (Sharabi2006). 
In fact the solving of the multi objective (multi task) 
problems in hardware system by evolutionary is treated 
by Coello (Collelo 2002) and used in more recent works 
(Zhao 2006). 

The question is: is possible to design digital circuits 
using an evolutionary algorithm like genetic algorithm 
(GA)? 

To answer of this question, first, the design of a 
reconfigurable circuit which can be programmed by 
evolutionary algorithm is required. A solution can be 
reconfigurable multilayer gates network. Each gate in 
layer x can be connected or not whit a gate in layer 
x+1. 
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RECONFIGURABLE GATES NETWORK 
AND HARDWARE GENETIC 
ALGORITHM 

Models Used 

This section is dedicated to elaborate models for the two 
components of the evolvable hardware microstruc- 
tures: hardware genetic algorithm and dynamically 
reconfigurable circuit. 

In this section is presented first the concept of 
generic dynamically reconfigurable structures. Here 
each cell of the network is give by computing units 
and configurable network unit. To run the algorithm 
a local computation is perform - arithmetic and logic 
operation by each cell, and a network computation 
- each cell can configure the connection with the others 
cell in the network. 

This concept can be extended to multilayer gates 
network by replace the local computations cell with el- 
ementary gates and network connections with switches 
commands by genetic algorithm (Ionescu2004). In this 
particularly case dynamically reconfigurable structure 
became a hardware reconfigurable structure. 



In figure 1 is presented the cell of reconfigurable 
hardware: the generic digital gate. 

Each generic gate can be configured with a local 
elementary Boolean function and more switch connec- 
tions with another gates. Thus, the first coding schema 
for genetic algorithm is conceived: the individual is 
a string of connections status and Boolean functions 
code like Figl.b. 

Each algorithm contains functions which can be 
executed by central processing unit. To the compute 
systems with single processor one single function can 
be executed at time. In hardware structure, each func- 
tion can be implemented in combinational network. 
There are two main kind of combinational networks: 
the sorting network composes by min/max elementary 
circuits and the permutation network composes by 
permutation elementary circuit. In figure 2 are pre- 
sented the recursive sorting circuit and the recursive 
permutation circuit. 




Figure 1. a) The generic gate - cell of the hardware reconfigurable structure, b) gates network coded as con- 
figuration vector 
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Figure 2. The recursive sorting and permutation circuit used for hardware implementation of the genetic algo- 
rithm modules 
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Design of Hardware Genetic Algorithm, 
Dynamic Reconfigurable Circuit and 
Application Description 

This section gets into the techniques used for evolvable 
hardware microstructures design. First it presents 
the design of hardware genetic algorithm by using the 
models from the preceding section. Each module are 
designed individually, describe with combinational 
networks. The hardware genetic algorithm is a circuit 
which connect all modules in a fully hardware solu- 
tion. The block diagram of HGA (hardware genetic 
algorithm) is presented in figure below (fig.3). 

The main issue of this solution is that the modules 
can be used in another algorithm and can be intercon- 
nected in another way without to be redesigned. Each 
module can work individual, therefore the structure 
can process more generation in the same time. On the 
other side, each module was designed as networks of 
elementary functions like min/max or permutation. 
Thus any new change claimed like increasing the size 
or number of individuals is very easily to made by 
adding the elementary functions blocks. In the selec- 
tion module individual (bits string) does, with fitness 
computed, enter in the left side of the array. Each cell 
collates fitness values from two inputs x. and y. . The 

r in J in 

output x om get the individual with the smallest fitness 
value from the inputs and the output y individual with 



the biggest one. So, the individuals with poor (small 
value) fitness will cross array on horizontal, from the 
left to right and individuals with good fitness will cross 
array from top to bottom on vertical. Finally, we have 
in the left side outputs with the best fitness. 

The same concept are used to crossover circuit. 
Some individual sorted by selection module enter in 
crossover module. Operator is applied on a certain 
number of pair of individuals. Usually, individuals 
with the best fitness are "parents" for generation 1 of 
offspring which result from this module. 

For mutation module we use one single column 
of logical structures (mutation cells) too. Array is 
design with follow restriction: mutation must affect 
only one single bit from an individual and one single 
individual from generation (all individuals from one 
iteration step). Structure can be easily changed if we 
want another behavior. 

Evaluation is done by comparing response of partial 
solutions provide by each individual and the desirable 
response. Here is defining a target Boolean function, 
function which must be implemented in hardware. 
Another evaluation criterion is give by minimization 
of digital resources used to implement target function. 
The fitness value is compute consulting both evalua- 
tion criterions. 



612 



Evolved Synthesis of Digital Circuits 



Figure 3. Hardware genetic algorithm 
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Three dynamic reconfigurable circuits are de- 
signed and tested. All are based on hardware recon- 
figurable structure presented in figure 1 . First schema, 
min-max terms reconfigurable circuit, use the same 
principles as programmable logic array. The scheme is 
composed by three layers: INV layer, AND layer and 
OR layer. Genetic algorithm command connections 
between INV layer and AND layer. This reconfigurable 
circuit has the fast convergence speed and the individu- 
als with the smallest size but explore only traditional 
space solution and its size grow exponentially with 
inputs number and linear with outputs number. The 
second circuit is reconfigurable INV-AND-OR circuit. 
Like the first circuit, it has three layers : INV layer, AND 
layer and OR layer. Genetic algorithm configure in this 
case connections between INV - AND layer and AND 
- OR layer. This schema reduces the increase of size 
with number of outputs but remain exponential increase 
with number of inputs and the size of individuals is 
bigger than first circuit. 

The last reconfigurable circuit is elementary func- 
tions reconfigurable circuit (e - reconfigurable). It 
contains more layers. Each layer contains a number of 



generic gates. Generic gate can implement a Boolean 
elementary functions (AND, OR, XOR) and more 
complex circuits like MUX. This solution increases 
the size of the individuals and the complexity of the 
reconfigurable circuit but is almost invariant with 
number of inputs and outputs. 

The last reconfigurable circuit explores the larg- 
est solution space, beyond the bounds of traditional 
design methods. 

The evolvable hardware is used in three applica- 
tions. First, the target function is static and algorithm 
must find hardware solution to implement it. Each 
individual represent here a potential solution for hard- 
ware implementation of target function. Evolution loop 
is repeated until optimal solution is found. Hardware 
solution finding here is named evolved hardware. At 
this time evolution loop is stopped. 

In the second application the target function is also 
static. But here the individual codes only a sub circuit 
- one generic gate or one gates layer. The individuals 
evolve and the offspring replace the parents in different 
position and evaluation is done to entire circuit until the 
new solution is better than the old solution. Evolution 
loop is repeated until optimal solution is found. 

This solution is used to design circuit with big 
number of inputs, outputs and sub circuits. 



Figure 4. Reconfigurable elementary functions circuit 
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Figure 5. Application schema: Finding optimal solution for target function implementation 
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The last applications use dynamic target functions. 
Here each individual represent a complete solution for 
circuit. Evolution loop here is in two steps. First step 
is same like in the first application: loop until solution 
is found. After the solution is found in an individual 
named main individual the evolution continue for the 
others individuals. The target of the second step in 
evolution loop is to obtain different individuals rela- 
tive to the main individual. When the target function 
is changed, the evolution loop pass in first step and the 
individuals, with high degree of dispersion, evolve to 
new solution. 



CONCLUSIONS 

In this paper we have presented the concept of the 
evolvable hardware and show a practical implementa- 
tion of hardware genetic algorithm and reconfigurable 
hardware structure. 

Hardware genetic algorithm increases the conver- 
gence speed to solutions which represent configuration 
for reconfigurable circuit. 

It can be used for evolvable synthesis of digital 
circuit in intrinsic evolvable hardware. The bit string 
solutions which are giving by genetic algorithm can be 
connections configuration for a dynamic reconfigurable 
hardware circuit. 



We present here three architectures of reconfigurable 
circuits which can be dynamically programmed by same 
hardware genetic algorithm module. The structure was 
implemented on Xilinx FPGA Spartan 3. 



FUTURE TRENDS 

There are more directions of research from this paper. 
First is design of reconfigurable circuit by using FPGA 
primitives. The new generation of FPGA (Virtex5) al- 
lows dynamically reconfiguration using primitives. In 
this case the generic gate is replaced by physical cells 
from FPGA. Another direction is implementation of 
hybrid neuro-genetic structure. Ahardware implementa- 
tion neural network can be used to store the best solu- 
tions from genetic algorithm. This configuration can 
be used to improve convergence of genetic algorithm. 
The evolved hardware can be used to design analog 
circuits. In this case, Boolean reconfigurable circuit 
can be replacing by analog reconfigurable circuit (like 
Field Programmable Transistors Area). 
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KEY TERMS 

Evolvable Hardware: Reconfigurable circuit which 
is programmed by evolved algorithm like GA. To ex- 
trinsic evolvable hardware evolved algorithm run to 
host station outside of the reconfigurable circuit (PC). 
To intrinsic evolvable hardware evolved algorithm 
run inside the same system with reconfigurable circuit 
(even same chip). 

Genetic Algorithms (GA): A genetic algorithm 
(or GA) is a search technique used in computing to 
find true or approximate solutions to optimization and 
search problems. Genetic algorithms are categorized as 
a stochastic local search technique. Genetic algorithms 
are a particular class of evolutionary algorithms that 
use techniques inspired by evolutionary biology such 
as inheritance, mutation, selection, and crossover (also 
called recombination).Individuals, with coding schema, 
are initial random values. All individuals are, in the first 
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step of algorithm evaluated to get the fitness value. In 
the next step, they are sorted by fitness and selected 
for genetic operators. The parents are the individuals 
involved in genetic operators like crossover or muta- 
tion. The offspring resulted are evaluated together 
with parents and the algorithm resume with the first 
step. The loop is repeated until the solution is find or 
the number of generation reach the limit given by the 
programmer. 

Genotype: Describe the genetic constitution of 
an individual, that is the specific allelic makeup of an 
individual. In evolvable hardware it consist in a vector 
of configuration bits. 

HGA: Hardware genetic algorithm is a hardware 
implementation of genetic algorithm. Hardware imple- 
mentation increases the performance of the algorithm 
by replacing serial software modules with parallel 
hardware. 



Microstructure: Integration of structure in same 
chip. Evolvable hardware microstructure is an intrinsic 
evolvable hardware with all modules in same chip. 

Phenotype: Describe one of the traits of an indi- 
vidual that is measurable and that is expressed in only 
a subset of the individuals within that population. In 
evolvable hardware phenotype consist in the circuit 
coded by an individual. 

Reconfigurable Circuit: Hardware structure con- 
sist in logical cell network which allow configuration 
of the interconnections between cells 
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INTRODUCTION 

One of the most successful tools in the Artificial Intelli- 
gence (AI) world is Artificial NeuralNetworks(ANNs). 
This technique is a powerful tool used in many different 
environments, with many different purposes, like clas- 
sification, clustering, signal modelization, or regression 
(Haykin, 1999). Although they are very easy to use, 
their creation is not a simple task, because the expert 
has to do much effort and spend much time on it. 

The development of ANNs can be divided into 
two parts: architecture development and training and 
validation. The architecture development determines 
not only the number of neurons of the ANN, but also 
the type of the connections among those neurons. The 
training determines the connection weights for such 
architecture. 

The architecture design task is usually performed 
by means of a manual process, meaning that the expert 
has to test different architectures to find the one able to 
achieve the best results. Each architecture trial means 
training and validating it, which can be a process that 
needs many computational resources, depending on the 
complexity of the problem. Therefore, the expert has 
much participation in the whole ANN development, 
although techniques for relatively automatic creation 
of ANNs have been recently developed. 



BACKGROUND 

ANN development is a research topic that has attracted 
many researchers from the world of evolutionary algo- 
rithms (Nolfi & Parisi D., 2002) (Cantii-Paz & Kamath, 
2005). These techniques follow the general strategy of 
an evolutionary algorithm: an initial population with 
different types of genotypes encoding also different 



parameters - commonly, the connection weights and/ 
or the architecture of the network and/or the learning 
rules - is randomly created and repeatedly induced to 
evolve. 

The most direct application of EC tools in the 
ANN world is to perform the evolution of the weights 
of the connections. This process starts from an ANN 
with an already determined topology. In this case, the 
problem to be solved is the training of the connection 
weights, attempting to minimise the network failure. 
Most of training algorithms, as backpropagation (BP) 
algorithm (Rumelhart, Hinton & Williams, 1986), are 
based on gradient minimisation, which presents several 
inconveniences. The main of these disadvantages is 
that, quite frequently, the algorithm gets stuck into a 
local minimum of the fitness function and it is unable 
to reach a global minimum. One of the options for 
overcoming this situation is the use of an evolutionary 
algorithm, so the training process is done by means 
of the evolution of the connection weights within the 
environment defined by both, the network architecture, 
and the task to be solved. In such cases, the weights 
can be represented either as the concatenation of binary 
values or of real numbers on a genetic algorithm (GA) 
(Greenwood, 1997). 

The evolution of architectures consists on the 
generation of the topological structure, i.e., establish- 
ing the connectivity and the transfer function of each 
neuron. To achieve this goal with an evolutionary 
algorithm, it is needed to choose how to encode the 
genotype of a given network for it to be used by the 
genetic operators. 

The most typical approach is called direct encoding. 
In this technique there is a one-to-one correspondence 
between each of the genes and a determined part of the 
network. A binary matrix represents an architecture 
where every element reveals the presence or absence 
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of connection between two nodes (Alba, Aldana & 
Troya, 1993). 

In comparison with direct encoding, there are some 
indirect encoding methods. In these methods, only 
some characteristics of the architecture are encoded 
in the chromosome. These methods have several types 
of representation. 

Firstly, the parametric representations represent the 
network as a group of parameters such as number of 
hidden layers, number of nodes for each layer, number 
of connections between two layers, etc (Harp, Samad & 
Guha, 1989). Another non direct representation type is 
based on a representation system that uses grammatical 
rules (Kitano, 1990), shaped as production rules that 
make a matrix that represents the network. 

Another type of encoding is the growing methods. In 
this case, the genotype contains a group of instructions 
for building up the network (Nolfi & Parisi, 2002). 

All of these methods evolve architectures, either 
alone (most commonly) or together with the weights. 
The transfer function for every node of the architecture 
is supposed to have been previously fixed by a human 
expert and is the same for all the nodes of the network 
or, at least, all the nodes of the same layer. Only few 
methods that also induce the evolution of the transfer 
function have been developed (Hwang, Choi & Park, 
1997). 



ANN DEVELOPMENT WITH GENETIC 
PROGRAMMING 

This section very briefly shows an example of how to 
develop ANNs using an AI tool, Genetic Programming 
(GP), which performs an evolutionary algorithm, and 
how it can be applied to Data Mining tasks. 

Genetic Programming 

GP (Koza, 92) is based on the evolution of a given 
population. Its working is similar to a GA. In this 
population, every individual represents a solution for a 
problem that is intended to be solved. The evolution is 
achieved by means of the selection of the best individu- 
als - although the worst ones have also a little chance 
of being selected - and their mutual combination for 
creating new solutions. After several generations, the 
population is expected to contain some good solutions 
for the problem. 



The GP encoding for the solutions is tree-shaped, so 
the user must specify which are the terminals (leaves 
of the tree) and the functions (nodes capable of hav- 
ing descendants) for being used by the evolutionary 
algorithm in order to build complex expressions. 

The wide application of GP to various environments 
and its consequent success are due to its capability 
for being adapted to numerous different problems. 
Although the main and more direct application is the 
generation of mathematical expressions (Rivero, Rabu- 
nal, Dorado & Pazos, 2005), GP has been also used 
in other fields such as filter design (Rabunal, Dorado, 
Puertas, Pazos, Santos & Rivero D., 2003), knowledge 
extraction, image processing (Rivero, Rabunal, Dorado 
& Pazos, 2004), etc. 

Model Overview 

This work will use a graph-based codification to rep- 
resent ANNs in the genotype. These graphs will not 
contain any cycles. Due to this type of codification the 
genetic operators had to be changed in order to be able 
to use the GP algorithm. The operators were changed 
in this way: 

The creation algorithm must allow the creation 
of graphs. This means that, at the moment of the 
creation of a node's child, this algorithm must al- 
low not only the creation of this node, but also a 
link to an existing one in the same graph, without 
making cycles inside the graph. 
The crossover algorithm must allow the crossing 
of graphs. This algorithm works very similar to 
the existing one for trees, i.e. a node is chosen 
on each individual to change the whole subgraph 
it represents to the other individual. Special care 
has to be taken with graphs, because before the 
crossover there may be links from outside this 
subgraph to any nodes on it. In this case, after the 
crossover these links are updated and changed to 
point to random nodes in the new subgraph. 
The mutation algorithm has been changed too, 
and also works very similar to the GP tree-based 
mutation algorithm. A node is chosen from the 
individual and its subgraph is deleted and replaced 
with a new one. Before the mutation occurs, there 
may be nodes in the individual pointing to other 
nodes in the subgraph. These links are updated 
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Table 1. Summary of the operators to be used in the tree 



Node 


Type 


Num. 


Children 


Children type 


ANN 


ANN 


Num 


L. OUtpUtS 


NEURON 


n-Neuron 


NEURON 




2*n 


n NEURON 
n REAL (weights) 


n-Input 


NEURON 







- 


+,-,*,% 


REAL 




2 


REAL 


[-4.4] 


REAL 







- 



and made to point to random nodes in the new 
subgraph. 

These algorithms must also follow two restrictions 
in GP: typing and maximum height. The GP typing 
property (Montana, 1995) means that each node will 
have a type and will also provide which type will have 
each of its children. This property provides the ability 
of developing structures that follow a specific grammar. 



In order to be able to use GP to develop any kind of 
system, it is necessary to specify the set of operators 
that will be in the tree. With them, the evolutionary 
system must be able to build correct trees that repre- 
sent ANNs. An overview of the operators used can be 
seen on Table 1. 

This table shows a summary of the operators that 
can be used in the tree. This set of terminals and func- 
tions are used to build a tree that represents an ANN. 



Figure 1. GP graph and its resulting network 
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Although these sets are not explained in the text, in Fig. 
1 can be seen an example of how they can be used to 
represent an ANN. 

These operators are used to build GP trees. These 
trees have to be evaluated, and, once the tree has been 
evaluated, the genotype turns into phenotype. In other 
words, it is converted into an ANN with its weights 
already set (thus it does not need to be trained) and 
therefore can be evaluated. The evolutionary process 
demands the assignation of a fitness value to every 
genotype. Such value is the result of the evaluation 
of the network with the pattern set that represents the 
problem. This result is the Mean Square Error (MSE) 
of the difference between the network outputs and 
the desired outputs. Nevertheless, this value has been 
modified in order to induce the system to generate 
simple networks. The modification has been made by 
adding a penalization value multiplied by the number 
of neurons of the network. In such way, and given that 
the evolutionary system has been designed in order to 
minimise an error value, when adding a fitness value, 
a larger network would have a worse fitness value. 
Therefore, the existence of simple networks would 
be preferred as the penalization value that is added is 
proportional to the number of neurons at the ANN. The 
calculus of the final fitness will be as follows: 

fitness = MSE + N*P 

where N is the number of neurons of the network and 
P is the penalization value for such number. 

Example of Applications 

This technique has been used for solving problems 
of different complexity taken from the UCI (Mertz & 
Murphy, 2002). All these problems are knowledge- 
extraction problems from databases where, taking 
certain features as a basis, it is intended to perform a 
prediction about another attribute of the database. A 
small description of the problems to be solved can be 
seen at Table 2, along with other ANN parameters used 
later in this work. 

All these databases have been normalised between 
and 1 and divided into two parts, taking the 70% of 
the data base for training and using the remaining 30% 
for performing tests. 



Results and Comparison with Other 
Methods 

Several experiments have been performed in order to 
evaluate the system performance. The values taken 
for the parameters at these experiments were the fol- 
lowing: 

Population size: 1000 individuals. 

Crossover rate: 95%. 

Mutation probability: 4%. 

Selection algorithm: 2-individual tournament. 

Graph maximum height: 5. 

Maximum inputs for each neuron: 9. 

Penalization value: 0.00001. 

To achieve these values, several experiments had to 
be done in order to obtain values for these parameters 
that would return good results to all of the problems. 
These problems are very different in complexity, so it 
is expected that these parameters give good results to 
many different problems. 

In order to evaluate its performance, the system 
presented here has been compared with other ANN 
generation and training methods. 

The method 5x2cv was used by Cantu-Paz and 
Kamath (1995) for the comparison of different ANN 
generation and training techniques based on EC tools. 
This work presents as results the average precisions 
obtained in the 10 test results generated by this method. 
Such values are the basis for the comparison of the 
technique described here with other well known ones, 
described in detail by Cantu-Paz and Kamath (1995). 
Such work shows the average times needed to achieve 
the results. Not having the same processor that was used, 
the computational effort needed for achieving the results 
can be estimated. This effort represents the number of 
times that the pattern file was evaluated. The compu- 
tational effort for every technique can be measured 
using the population size, the number of generations, 
the number of times that the BP algorithm was applied, 
etc. This calculation varies for every algorithm used. 
All the techniques that are compared with the work are 
related to the use of evolutionary algorithms for ANN 
design. Five iterations of a 5-fold crossed validation 
test were performed in all these techniques in order to 
evaluate the accuracy of the networks. These techniques 
are connectivity matrix, pruning, parameter search and 
graph-rewriting grammar. 
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Table 2 shows a summary of the number of neurons 
used by Cantu-Paz and Kamath (1995) in order to solve 
the problems that were used with connectivity matrix 
and pruning techniques. The epoch number of the BP 
algorithm, when used, is also indicated here. 

Table 3 shows the parameter configuration used 
by these techniques. The execution was stopped after 



5 generations with no improvement or after 50 total 
generations. 

The results obtained with these 4 methods are shown 
in Table 4. Every box of the table indicates 3 different 
values: precision value obtained by Cantu-Paz and 
Kamath (1995) (left), computational effort needed for 
obtaining such value with that technique (below) and 



Table 2. Summary of the problems to be solved 



Problem 


Description 


ANN configuration 


Number of 


Number of 


Number 






BP 




inputs 


instances 


of 
outputs 


Inputs 


Hidden Outputs 


Epochs 


Breast Cancer 


9 


699 


1 


9 


5 1 


20 


Iris Flower 


4 


150 


3 


4 


5 3 


80 


Heart Disease 


13 


303 


1 


26 


5 1 


40 


Ionosphere 


34 


351 


1 


34 


10 1 


40 



Table 3. Parameters of the techniques used for the comparison 



Parameter 


Matrix 


Pruning 


Parameters 


Grammar 


Chromosome length (L) 
Population size 

Crossover points 
Mutation rate 


N 

[3VL_ 

L/10 
1/L 


N 

\3y[I_ 

L/10 
1/L 


36 
25 

2 
0.04 


256 
64 

L/10 
0.004 



N= (hidden + output) *input + output*hidden 



Table 4. Comparison with other methods 



Problem 



Matrix 



Pruning 



Parameters 



Grammar 



Breast Cancer 



96.77 96.27 
92000 



96.31 95.79 
4620 



96.69 96.27 
100000 



96.71 96.31 
300000 



Iris Flower 



92.40 95.49 
320000 



92.40 81.58 
4080 



91.73 95.52 
400000 



92.93 95.66 
1200000 



Heart Cleveland 76.78 81.11 89.50 78.28 65.89 81.05 72.8 80.97 
304000 7640 200000 600000 



Ionosphere 



87.06 88.34 
464000 



83.66 82.37 
11640 



85.58 87.81 
200000 



88.03 88.36 
600000 



Average 



88.25 90.30 90.46 84.50 84.97 90.16 87.61 90.32 
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precision value obtained with the technique described 
here and related to the previously mentioned compu- 
tational effort value (right). 

Watching this table, it is obvious that the results 
obtained with the method proposed here are, not 
only similar to the ones presented by Cantu-Paz and 
Kamath (1995), but better in most of the cases. The 
reason of this lies in the fact that these methods need 
a high computational load since training is necessary 
for every case of network (individual) evaluation, 
which therefore turns to be time-consuming. During 
the work described here, the procedures for design and 
training are performed simultaneously, and therefore, 
the times needed for designing as well as for evaluating 
the network are combined. 



the error given by the rest of the ANN development 
systems used for the comparison. Only one technique 
(pruning) performs better that the one described here. 
However, that technique still needs some work from 
the expert, to do the design of the initial network. 

Most of the techniques used for the ANN devel- 
opment are quite costly, due in some cases to the 
combination of training with architecture evolution. 
The technique described here is able to achieve good 
results with a low computational cost and besides, the 
added advantage is that, not only the architecture and 
the connectivity of the network are evolved, but also the 
network itself undergoes an optimization process. 
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FUTURE TRENDS 

The future line of works in this area would be the study 
of the system parameters in order to evaluate their 
impact on the results from different problems. 

Another interesting line consists on the combina- 
tion of this graph evolution algorithm with a GA that 
performs an optimization process on the weight values. 
With this modification, the whole system will have 
two levels: 



The development of the experiments described in this 
work, has been performed with equipments belonging to 
the Super Computation Center of Galicia (CESGA). 
The Cleveland heart disease database was avail- 
able thanks to Robert Detrano, M.D., Ph.D., V.A. 
Medical Center, Long Beach and Cleveland Clinic 
Foundation. 
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KEY TERMS 

Artificial Neural Networks: Interconnected set 
of many simple processing units, commonly called 
neurons, that use a mathematical model, that represents 
an input/output relation, 

Back-Propagation Algorithm: Supervised learn- 
ing technique used by ANNs, that iteratively modifies 
the weights of the connections of the network so the 
error given by the network after the comparison of the 
outputs with the desired one decreases. 

Evolutionary Computation: Set of Artificial In- 
telligence techniques used in optimization problems, 
which are inspired in biologic mechanisms such as 
natural evolution. 

Genetic Programming: Machine learning tech- 
nique that uses an evolutionary algorithm in order to 
optimise the population of computer programs accord- 
ing to a fitness function which determines the capability 
of a program for performing a given task. 

Genotype: The representation of an individual on 
an entire collection of genes which the crossover and 
mutation operators are applied to. 

Phenotype: Expression of the properties coded by 
the individual's genotype. 

Population: Pool of individuals exhibiting equal or 
similar genome structures, which allows the application 
of genetic operators. 

Search Space: Set of all possible situations of the 
problem that we want to solve could ever be in. 
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INTRODUCTION 

Facial expression plays an important role in cognition 
of human emotions (Fasel, 2003 & Yeasin, 2006). The 
recognition of facial expressions in image sequences 
with significant head movement is a challenging 
problem. It is required by many applications such as 
human-computer interaction and computer graphics 
animation (Canamero, 2005 & Picard, 2001). To clas- 
sify expressions in still images many techniques have 
been proposed such as Neural Nets (Tian, 2001), Gabor 
wavelets (Bartlett, 2004), and active appearance models 
(Sung, 2006). Recently, more attention has been given 
to modeling facial deformation in dynamic scenarios. 
Still image classifiers use feature vectors related to 
a single frame to perform classification. Temporal 
classifiers try to capture the temporal pattern in the 
sequence of feature vectors related to each frame such 
as the Hidden Markov Model based methods (Cohen, 
2003, Black, 1997 & Rabiner, 1989) and Dynamic 
Bayesian Networks (Zhang, 2005). The main contribu- 
tions of the paper are as follows. First, we propose an 
efficient recognition scheme based on the detection of 
keyframes in videos where the recognition is performed 
using a temporal classifier. Second, we use the proposed 
method for extending the human-machine interaction 
functionality of a robot whose response is generated 
according to the user's recognized facial expression. 
Our proposed approach has several advantages. 
First, unlike most expression recognition systems that 
require a frontal view of the face, our system is view- 
and texture-independent. Second, its learning phase is 
simple compared to other techniques (e.g., the Hidden 
Markov Models and Active Appearance Models), that 
is, we only need to fit second-order Auto-Regressive 
models to sequences of facial actions. As a result, 
even when the imaging conditions change the learned 
Auto-Regressive models need not to be recomputed. 



The rest of the paper is organized as follows. Section 2 
summarizes our developed appearance-based 3D face 
tracker that we use to track the 3D head pose as well 
as the facial actions. Section 3 describes the proposed 
facial expression recognition based on the detection of 
keyframes. Section 4 provides some experimental re- 
sults. Section 5 describes the proposed human-machine 
interaction application that is based on the developed 
facial expression recognition scheme. 



SIMULTANEOUS HEAD AND FACIAL 
ACTION TRACKING 

In our study, we use the Candide 3D face model (Ahl- 
berg, 2001). This 3D deformable wireframe model is 
given by the 3D coordinates of n vertices. Thus, the 3D 
shape can be fully described by the 3n-vector q - the 
concatenation of the 3D coordinates of all vertices. 
The vector q can be written as: 



9 = 9 S +A ^c 



(1) 



where q s is the static shape of the model, T a is the 
facial action vector, and the columns of A are the 
Animation Units. In this study, we use six modes for 
the facial Animation Units (AUs) matrix A, that is, the 
dimension of t is 6. These modes are all included in 

a 

the Candide model package. We have chosen the six 
following AUs: lower lip depressor, lip stretcher, lip 
corner depressor, upper lip raiser, eyebrow lowerer and 
outer eyebrow raiser. A cornerstone problem in facial 
expression recognition is the ability to track the local 
facial actions/deformations. In our work, we track 
the head and facial actions using our face tracker 
(Dornaika & Davoine, 2006). This appearance-based 
tracker simultaneously computes the 3D head pose and 
the facial actions t by minimizing a distance between 
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the incoming warped frame and the current appearance 
of the face. Since the facial actions, encoded by the 
vector T fl , are highly correlated to the facial expres- 
sions, their time series representation can be utilized 
for inferring the facial expression in videos. This will 
be explained in the sequel. 



EFFICIENT FACIAL EXPRESSION 
DETECTION AND RECOGNITION 

In (Dornaika & Raducanu, 2006), we have proposed 
a facial expression recognition method that is based 
on the time-series representation of the tracked facial 
actions t . An analysis-synthesis scheme based on 
learned auto-regressive models was proposed. In this 
paper, we introduce a process able to detect keyframes 



in videos. Once a keyframe is detected, the temporal 
recognition scheme described in (Dornaika & Radu- 
canu, 2006) will be invoked on the detected keyframe. 
The proposed scheme has two advantages. First, the 
CPU time corresponding to the recognition part will 
be considerably reduced since only few keyframes are 
considered. Second, since a keyframe and its neighbor 
frames are characterizing the expression, the discrimi- 
nation performance of the recognition scheme will 
be boosted. In our case, the keyframes are defined by 
the frames where the facial actions change abruptly. 
Thus, a keyframe can be detected by looking for a 
local positive maximum in the temporal derivatives 
of the facial actions. To this end, two entities will be 
computed from the sequence of facial actions T a that 
arrive in a sequential fashion: (i) the L t norm HtJ^, and 
(ii) the temporal derivative given by: 



Figure 1. Efficient facial expression detection and recognition based on keyframes 
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Figure 2. Keyframe detection and recognition applied on a 1600-frame sequence 
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D = 



4o 



dt 



k =z 



« dx 



«(0 



5t 



(2) 



In the above equation, we have used the fact that 
the facial actions are positive. Let W be the size of a 
temporal segment defining the temporal granulometry 
of the system. In other words, the system will detect 
and recognize at most one expression every W frames. 
In practice, Wbelongs to [0.5s, Is]. The whole scheme 
is depicted in Figure 1. 

In this figure, we can see that the system has three 
levels: the tracking level, the keyframe detection level, 
and the recognition level. The tracker provides the facial 
actions for every frame. Whenever the current video 
segment size reaches W frames, the keyframe detection 
is invoked to select a keyframe in the current segment 
if any. A given frame is considered as a keyframe if 
it meets three conditions: (1) the corresponding D t is a 
positive local maximum (within the segment), (2) the 
corresponding norm UtJ^ is greater than a predefined 
threshold, (3) its far from the previous keyframe by 
at least W frames. Once a keyframe is found in the 
current segment, the dynamical classifier described in 
(Dornaika & Raducanu, 2006) will be invoked. 

Figure 2 shows the results of applying the proposed 
detection scheme on a 1 600-frame sequence containing 
23 played expressions. Some images are shown in Figure 
4. The solid curve corresponds to the norm ||t J^, the 
dotted curve to the derivative D and the vertical bars 
correspond to the detected keyframes. In this example, 
the value of Wis set to 30 frames. As can be seen, out 
of 1600 frames only 23 keyframes will be processed 
by the expression classifier. 



EXPERIMENTAL RESULTS 

Recognition results: We used a 300-frame video 
sequence. For this sequence, we asked a subject to 
display several expressions arbitrarily (see Figure 
3). The middle of this figure shows the normalized 
similarities associated with each universal expression 
where the recognition is performed for every frame in 
the sequence. As can be seen, the temporal classifier 
(Dornaika & Raducanu, 2006) has correctly detected the 
presence of the surprise, joy, and sadness expressions. 
Note that the mixture of expressions at transition is 
normal since the recognition is performed in a frame- 
wise manner. The lower part of this figure shows the 
results of applying the proposed keyframe detection 
scheme. On a 3.2 GHz PC, a non-optimized C code 
of the developed approach carries out the tracking and 
recognition in about 60 ms. 

Performance study: In order to quantify the recog- 
nition rate, we have used 35 test videos retrieved from 
the CMU database. Table 1 shows the confusion matrix 
associated with the 35 test videos featuring 7 persons. 
As can be seen, although the recognition rate was good 
(80%), it is not equal to 100%. This can be explained 
by the fact that the expression dynamics are highly 
subj ect-dependent. Recall that the used auto-regressive 
models are built using data associated with one subj ect. 
Notice that the human 'ceiling' in correctly classifying 
facial expressions into the six basic emotions has been 
established at 91.7%. 




Table 1. Confusion matrix for the facial expression classifier associated with 35 test videos (CMU data). The 
model is built using one unseen person 
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Figure 3. Top: Four frames (50, 110, 150, and 250) associated with a 300-frame test sequence. Middle: The 
similarity measure computed for each universal expression and for each non-neutral frame of the sequence-the 
framewise recognition. Bottom: The recognition based on keyframe detection. 
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HUMAN-MACHINE INTERACTION 

Interpreting non-verbal face gestures is used in a wide 
range of applications. An intelligent user-interface 
not only should interpret the face movements but also 
should interpret the user's emotional state (Breazeal, 
2002). Knowing the emotional state of the user makes 
machines communicate and interact with humans in a 
natural way: intelligent entertaining systems for kids, 
interactive computers, intelligent sensors, social robots, 



to mention a few. In the sequel, we will show how our 
proposed technique lends itself nicely to such applica- 
tions. Without loss of generality, we use the AIBO robot 
which has the advantage of being especially designed for 
Human Computer Interaction. The input to the system 
is a video stream capturing the user's face. 

The AIBO robot: AIBO is a biologically-inspired 
robot and is able to show its emotions through an ar- 
ray of LEDs situated in the frontal part of the head. In 
addition to the LEDs ' configuration, the robot response 
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Figure 4. Top: Some detected keyframes associated with the 1600-frame video. Middle: The recognized expres- 
sion. Bottom: The corresponding robot's response. 
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contains some small head and body movements . From its 
concept design, AIBO's affective states are triggered by 
the Emotion Generator engine. This occurs as a response 
to its internal state representation, captured through 
multi-modal interaction (vision, audio and touch). For 
instance, it can display the 'happiness' feeling when it 
detects a face (through the vision system) or it hears 
a voice. But it does not possess a built-in system for 
vision-based automatic facial-expression recognition. 
For this reason, with the scheme proposed in this paper 
(see Section 3), we created an application for AIBO 
whose purpose is to enable it with this capability. 

This application is a very simple one, in which the 
robot is just imitating the expression of a human sub- 
ject. Usually, the response of the robot occurs slightly 
after the apex of the human expression. The results 
of this application were recorded in a 2 minute video 
which can be downloaded from the following address: 
http://www.cvc.uab.es/-bogdan/AIBO-emotions.avi. 
In order to be able to display simultaneously in the 
video the correspondence between subj ect's and robot's 
expressions, we put them side by side. 

Figure 4 illustrates five detected keyframes from the 
1 600 frame video depicted in Figure 2. These are shown 
in correspondence with the robot's response. The middle 
row shows the recognized expression. The bottom row 
shows a snapshot of the robot head when it interacts 
with the detected and recognized expression. 



CONCLUSION 

This paper described a view- and texture-independent 
approach to facial expression analysis and recognition. 
The paper presented two contributions. First, we pro- 
posed an efficient facial expression recognition scheme 
based on the detection of keyframes in videos. Second, 
we applied the proposed method in a Human Computer 
Interaction scenario, in which an AIBO robot is mir- 
roring the user's recognized facial expression. 
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KEY TERMS 

3D Deformable Model: A model which is able to 
modify its shape while being acted upon by an external 
influence. In consequence, the relative position of any 
point on a deformable body can change. 

Active Appearance Models (AAM): Computer Vi- 
sion algorithm for matching a statistical model of obj ect 
shape and appearance to a new image. The approach is 
widely used for matching and tracking faces. 

AIBO: One of several types of robotic pets de- 
signed and manufactured by Sony. Able to walk, "see" 
its environment via camera, and recognize spoken 
commands, they are considered to be autonomous 
robots, since they are able to learn and mature based 
on external stimuli from their owner or environment, 
or from other AIBOs. 

Autoregressive Models: Group of linear prediction 
formulas that attempt to predict the output of a system 
based on the previous outputs and inputs. 

Facial Expression Recognition System: Com- 
puter-driven application for automatically identifying 
person's facial expression from a digital still or video 
image. It does that by comparing selected facial features 
in the live image and a facial database. 

Hidden Markov Model (HMM) : Statistical model 
in which the system being modeled is assumed to be 
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a Markov process with unknown parameters, and the 
challenge is to determine the hidden parameters from 
the observable parameters. The extracted model param- 
eters can then be used to perform further analysis, for 
example for pattern recognition applications. 

Human-Computer Interaction (HCI): The study 
of interaction between people (users) and computers. 
It is an interdisciplinary subject, relating computer 
science with many other fields of study and research 
(Artificial Intelligence, Psychology, Computer Graph- 
ics, Design). 



Social Robot: An autonomous robot that interacts 
and communicates with humans by following the social 
rules attached to its role. This definition implies that a 
social robot has a physical embodiment. A consequence 
of the previous statements is that a robot that only 
interacts and communicates with other robots would 
not be considered to be a social robot. 

Wireframe Model: The representation of all sur- 
faces of a three-dimensional object in outline form. 
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INTRODUCTION 



BACKGROUND 



Many scientific disciplines use modelling and simula- 
tion processes and techniques in order to implement 
non-linear mapping between the input and the output 
variables for a given system under study. Any variable 
that helps to solve the problem may be considered as 
input. Ideally, any classifier or regressor should be able 
to detect important features and discard irrelevant fea- 
tures, and consequently, a pre-processing step to reduce 
dimensionality should not be necessary. Nonetheless, 
in many cases, reducing the dimensionality of a prob- 
lem has certain advantages (Alpaydin, 2004; Guyon 
& Elisseeff, 2003), as follows: 

Performance improvement. The complexity of 
most learning algorithms depends on the number 
of samples and features (curse of dimensionality). 
By reducing the number of features, dimension- 
ality is also decreased, and this may save on 
computational resources — such as memory and 
time — and shorten training and testing times. 
Data compression. There is no need to retrieve 
and store a feature that is not required. 
Data comprehension. Dimensionality reduction 
facilitates the comprehension and visualisation 
of data. 

Simplicity. Simpler models tend to be more robust 
when small datasets are used. 

There are two main methods for reducing dimen- 
sionality: feature extraction and feature selection. In 
this chapter we propose a review of different feature 
selection (FS) algorithms, including its main ap- 
proaches: filter, wrapper and hybrid - a filter/wrapper 
combination. 



Feature extraction and feature selection are the main 
methods for reducing dimensionality. In feature ex- 
traction, the aim is to find a new set of f dimensions 
that are a combination of the n original ones. The best 
known and most widely used unsupervised feature 
extraction method is principal component analysis 
(PCA); commonly used as supervised methods are 
linear discriminant analysis (LDA) and partial least 
squares (PLS). 

In feature selection, a subset of r relevant features 
is selected from a set n, whose remaining features will 
be ignored. As for the evaluation function used, FS ap- 
proaches can be mainly classified as filter or wrapper 
models (Kohavi & John, 1997). Filter models rely on 
the general characteristics of the training data to select 
features, whereas wrapper models require a predeter- 
mined learning algorithm to identify the features to be 
selected. Wrapper models tend to give better results, 
but when the number of features is large, filter models 
are usually chosen because of their computational ef- 
ficiency. In order to combine the advantages of both 
models, hybrid algorithms have recently been proposed 
(Guyon et al., 2006). 



FEATURE SELECTION 

The advantages described in the Introduction section 
denote the importance of dimensionality reduction. 
Feature selection is also useful when the following 
assumptions are made: 

There are inputs that are not required to obtain 
the output. 

There is a high correlation between some of the 
input features. 
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A feature selection algorithm (FSA) looks for an 
optimal set of features, and consequently, a paradigm 
that describes the FSA is heuristic search. Since each 
state of the search space is a subset of features, FSA 
can be characterised in terms of the following four 
properties (Blum & Langley, 1997): 

The initial state. This can be the empty set of 
features, the whole set or any random state. 
The search strategy. Although an exhaustive search 
leads to an optimal set of features, the associated 
computational and time costs are high when the 
number of features is high. Consequently, differ- 
ent search strategies are used so as to identify a 
good set of features within a reasonable time. 
The evaluation function used to determine the 
quality of each set of features. The goodness of a 
feature subset is dependent on measures. Accord- 
ing to the literature, the following measures have 
been employed: information measures, distance 
measures, dependence measures, consistency 
measures, and accuracy measures. 
The stop criterion. An end point needs to be es- 
tablished; for example, the process should finish 
if the evaluation function has not improved after 
a new feature has been added/removed. 

In terms of search method complexity, there are 
three main sub-groups (Salapa et al., 2007): 

Exponential strategies involving an exhaustive 
search of all feasible solutions. Exhaustive search 
guarantees identification of an optimal feature sub- 
set but has a high computational cost. Examples 
are the branch and bound algorithms. 
Sequential strategies based on a local search for 
solutions defined by the current solution state. 
Sequential search does not guarantee an optimal 
result, since the optimal solution could be in a 
region of the search space that is not searched. 
However, compared with exponential searching, 
sequential strategies have a considerably reduced 
computational cost. The best known strategies 
are sequential forward selection and sequential 
backward selection (SFS and SBS, respectively). 
SFS starts with an empty set of features and adds 
features one by one, while SBS begins with a full 
set and removes features one by one. Features are 
added or removed on the basis of improvements 



in the evaluation function. These approaches 
do not consider interactions between features, 
i.e., a feature may not reduce error by itself, but 
improvement may be achieved by the feature's 
link to another feature. Floating search (Pudil et 
al., 1994) solves this problem partially, in that 
the number of features included and/or removed 
at each stage is not fixed. Another approach 
(Sanchez et al., 2006) uses sensitivity indices 
(the importance of each feature is given in terms 
of the variance) to guide a backward elimination 
process, with several features discarded in one 
step. 

Random algorithms that employ randomness to 
avoid local optimal solutions and enable tempo- 
rary transition to other states with poorer solutions. 
Examples are simulated annealing and genetic 
algorithms. 

The most popular FSA classification, which refers 
to the evaluation function, considers the three (Blum 
& Langley, 1997) or last two (Kohavi & John, 1997) 
groups, as follows: 

Embedded methods. The induction algorithm is 
simultaneously an FSA. Examples of this method 
are decision trees, such as classification and regres- 
sion trees (CART), and artificial neural networks 
(ANN). 

Filter methods. Selection is carried out as a pre- 
processing step with no induction algorithm 
(Figure 1). The general characteristics of the 
training data are used to select features (for 
example, distances between classes or statisti- 
cal dependencies). This model is faster than the 
wrapper approach (described below) and results 
in a better generalisation because it acts inde- 
pendently of the induction algorithm. However, 
it tends to select subsets with a high number of 
features (even all the features) and so a threshold 
is required to choose a subset. 
Wrapper methods. Wrapper models use the 
induction algorithm to evaluate each subset of 
features, i.e., the induction algorithm is part of the 
evaluation function in the wrapper model, which 
means this model is more precise than the filter 
model. It also takes account of techniques, such as 
cross-validation, that avoid over-fitting. However, 
wrapper models are very time consuming, which 
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Figure 1. Filter algorithm 
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restricts application with some datasets. Moreover, 
although they may obtain good results with the 
inherent induction algorithm, they may perform 
poorly with an alternative algorithm. 

Hybrid methods that combine filter and wrapper 
methods have recently been attracting a great deal of 
attention in the FS literature (Liu & Motoda, 1998; 
Guyon et al., 2006). Although the following sections 
of this chapter are mainly devoted to filter and wrap- 
per methods, a brief review of the most recent hybrid 
methods is also included. 

Filter Methods 

A number of representative filter algorithms are de- 
scribed in the literature, such as/ 2 -Statistic, information 
gain, or correlation based feature selection (CFS). For 
the sake of completeness, we will refer to two classical 
algorithms (FOCUS and RELIEF) and will describe 
very recently developed filter methods (FCBF and 
INTERACT). An exhaustive discussion of filter meth- 
ods is provided in Guyon et al. (2006) — including of 
methods such as Random Forests (RF), an ensemble 
of tree classifiers. 

FOCUS 

In FOCUS (Almuallim & Dietterich, 1991) all feature 
subsets of increasing size are evaluated until a suit- 
able subset is encountered. Feature subset q is said 
to be suitable if there is no pair of examples that have 
different class values and the same values for all the 
features in q. The successor of this algorithm is FO- 
CUS2 (Almuallim & Dietterich, 1 992), which prunes 
the search space, thereby evaluating only promising 
subsets. FOCUS 2 is therefore much faster than FO- 



CUS. However, using both algorithms in domains with 
a large number of features may be computationally 
unfeasible. Consequently, search heuristics are used in 
different versions of the algorithm, resulting in good 
but not necessarily optimal solutions. 

RELIEF 

The RELIEF algorithm (Kira & Rendell, 1992) esti- 
mates the quality of attributes according to how well 
their values distinguish between instances that are 
near to each other. For this purpose, given a randomly 
selected instance, x s = {x ls ,x 2s , . . . ,x ns ) , RELIEF searches 
for its two nearest neighbours: one from the same class, 
called nearest hit H, and the other from a different 
class, called nearest miss M. It then updates the 
quality estimate for all the features, depending on the 
values for x s , M, and H. RELIEF can deal with discrete 
and continuous features but is limited to two-class 
problems . An extension — Relief F — not only deals with 
multiclass problems but is also more robust and capable 
of dealing with incomplete and noisy data. Relief F was 
subsequently adapted for continuous class (regression) 
problems, resulting in the RRelief F algorithm (Robnik- 
Sikonja & Kononenko, 2003). 

FCBF and INTERACT 

The fast correlated-based filter (FCBF) method (Yu & 
Liu, 2003) is based on symmetrical uncertainty (SU), 
which is defined as the ratio between the information 
gain and the entropy of two features, x andy: 



SU(x,y) = 2 



IGjxIy) 
H(x) + H(y) 
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This method was designed for high-dimensionality 
data and has been shown to be effective in removing 
both irrelevant and redundant features. However, it 
fails to take into consideration the interaction between 
features. The INTERACT algorithm (Zhao & Liu, 
2007) uses the same goodness measure, SU, but also 
includes the consistency contribution (c-contribution). 
It can thus handle feature interaction, and efficiently 
selects relevant features. 

Wrapper Methods 

The idea of the wrapper approach is to select a fea- 
ture subset using a learning algorithm as part of the 
evaluation function (Figure 2). Instead of using sub- 
set sufficiency, entropy or another explicitly defined 
evaluation function, a kind of "black box" function is 
used to guide the search. The evaluation function for 
each candidate feature subset returns an estimate of 
the quality of the model that is induced by the learning 
algorithm. This can be rather time consuming, since, 
for each candidate feature subset evaluated during the 
search, the target learning algorithm is usually ap- 
plied several times (e.g., in the case of 10-fold cross 
validation being used to estimate model quality). Here 



we briefly describe several feature subset selection 
algorithms — developed in machine learning — that 
are based on the wrapper approach. The literature is 
vast in this area and so we will just focus on the most 
representative wrapper models. 

An interesting study of the wrapper approach was 
conducted by Kohavi & John ( 1 997). Besides introduc- 
ing the notion of strong and weak feature relevance, 
these authors showed the results achieved by different 
induction algorithms (ID3, C4.5, and naive Bayes) in 
several search methods (best first, hill-climbing, etc.). 
Aha & Bankert (1995) used a wrapper approach in 
instance-based learning and proposed a new search 
strategy that performs beam search using a kind of 
backward elimination; that is, instead of starting with 
an empty feature subset, the search randomly selects a 
fixed number of feature subsets and starts with the best 
among them. Caruana & Freitag (1994) developed a 
wrapper feature subset selection method for decision 
tree induction, proposing bidirectional hill-climbing 
for the feature space — as more effective than either 
forward or backward selection. Genetic algorithms 
have been broadly adopted to perform the search for 
the best subset of features in a wrapper way (Liu & 
Motoda, 1998, Huang et al. 2007). The feature selection 




Figure 2. Wrapper algorithm 



Training data 



Feature Search 



Set of 
features 



Measure of 
goodness 



Feature Evaluation 



Set of 
features 



Hypothesis 



Induction Algorithm 



Training 
data 



Induction 
Algorithm 



Test data 



Evaluation 



Accuracy 



635 



Feature Selection 



methods using support vector machines (SVMs) have 
obtained satisfactory results (Weston et al. , 200 1 ). SVMs 
are also combined with other techniques to implement 
feature selection (different approaches are described 
in Guyon et al., 2006). Kim et al. (2003) use artificial 
neural networks (ANNs) for customer prediction and 
ELSA (Evolutionary Local Selection Algorithm) to 
search for promising subsets of features. 

Hybrid Methods 

Whereas the computational cost associated with the 
wrapper model makes it unfeasible when the num- 
ber of features is high, when the filter model is used 
its performance is less than satisfactory. The hybrid 
model is a good combination of the two approaches 
that overcomes these problems. Hybrid methods use a 
filter to generate a ranked list of features. On the basis 
of the order thus defined, nested subsets of features are 
generated and computed by a learning machine, i.e. 
following a wrapper approach (Guyon et al., 2006). 
The main features of the hybrid model are depicted in 
Figure 3. One of the first hybrid approaches proposed 
was that of Yuan et al., 1999. Since then, the hybrid 
model has focused the attention of the research com- 
munity and, by now, numerous hybrid models have 
been developed to solve a variety of problems, such 
as intrusion detection, text categorisation, etc. 

As a combination of filter and wrapper models, 
there exist a great number of hybrid methods, so it is 



not possible to include all of them and therefore we will 
refer to some interesting ones. Some hybrid methods 
involving SVMs are presented in Guyon et al. (2006), 
chapters 20 and 22. Shazzad & Park (2005) investigate 
a fast hybrid method -a fusion of Correlation-based 
Feature Selection, Support Vector Machine and Genetic 
Algorithm- to determine an optimal feature set. A fea- 
ture selection model based both on information theory 
and statistical tests is presented by Sebban & Nock 
(2002). Zhu et al. (2007) incorporates a filter ranking 
method in a genetic algorithm to improve classification 
performance and accelerate the search process. 



FUTURE TRENDS 

Feature selection is a huge topic that it is impossible 
to discuss in a short chapter. To pinpoint new topics in 
this area we refer the reader to the suggestions given 
by Guyon et al. (2006), summarised as follows: 

Unsupervised variable selection. Although this 
chapter has focused on supervised feature selec- 
tion, several authors have attempted to implement 
feature selection for clustering applications (see, 
for example, Dy & Brodley, 2004). For supervised 
learning tasks, one may want to pre-filter a set 
of the most significant variables with respect to a 
criterion which does not make use of y to minimise 
the problem of over-fitting. 



Figure 3. Hybrid algorithm 
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Selection of examples. Mislabelled examples may 
induce a choice of wrong variables, so it may be 
preferable to jointly select both variables and 
examples. 

System reverse engineering. This chapter focuses 
on the problem of selecting features useful to build 
a good predictor. Unravelling the causal dependen- 
cies between variables and reverse engineering 
the system that produced the data is a far more 
challenging task that is beyond the scope of this 
chapter (but see, for example, Pearl, 2000). 



CONCLUSION 

Feature selection for classification and regression is a 
maj or research topic in machine learning. It covers many 
different fields, such as, for example, text categorisa- 
tion, intrusion detection, and micro-array data. This 
study reviews key algorithms used for feature selection, 
including filter, wrapper and hybrid approaches. The 
review is not exhaustive and is merely designed to give 
an idea of the state of the art in the field. Most feature 
selection algorithms lead to significant reductions in 
the dimensionality of the data without sacrificing the 
performance of the resulting models. Choosing between 
approaches depends on the problem in hand. Adopting 
a filtering approach is computationally acceptable, 
but the more complex wrapper approach tends to pro- 
duce greater accuracy in the final result. The filtering 
approach is very flexible, since any target learning 
algorithm can be used. It is also faster than the wrap- 
per approach. This latter, on the other hand, is more 
dependent on the learning algorithm; but the selection 
process is better. The hybrid approach offers promise 
in terms of improving results in terms of classification 
accuracy as well as in terms of the identification of 
relevant attributes for the analysis. 



REFERENCES 

Aha, D.W., and Bankert, R. L. (1995). A comparative 
evaluation of sequential feature selection algorithms. 
Proceedings of the Fifth International Workshop on 
Artificial Intelligence and Statistics, 1-7. Springer- 
Verlag. 



Almuallim, H. & Dietterich, T. G (1991). Learning 
with many irrelevant features. Proceedings of the 9 th 
National Conference on Artificial Intelligence, 547- 
552, AAAI Press. 

Almuallim, H. & Dietterich, T. G. ( 1 992) Efficient algo- 
rithms for identifying relevant features. Proceedings of 
the 9th Canadian Conference on Artificial Intelligence, 
38-45, Vancouver. 

Alpaydin, E. (2004). Introduction to Machine Learn- 
ing. MIT Press. 

Blum, A. L. &Langley,P. (1997). Selection of relevant 
features and examples in machine learning. Artificial 
Intelligence, (97) 1-2, 245-271. 

Caruana, R. & Freitag, D. (1994). Greedy attribute 
selection. Proceedings of the Eleventh International 
Conference on Machine Learning. Morgan Kaufmann 
Publishers, Inc., 28-36. 

Dy, J. G. & Brodley, C. E. (2004). Feature Selection for 
Unsupervised Learning. Journal of Machine Learning 
Research, (5), 845-889. 

Guyon, I. & Elisseeff, A. (2003). An introduction to 
variable and feature selection. Journal of Machine 
Learning Research, (3), 1157-1182. 

Guyon, L, Gunn, S., Nikravesh, M. & Zadeh, L.A. 
(2006). Feature Extraction. Foundations and Applica- 
tions. Springer. 

Huang, J., Cai, Y. & Xu, X. (2007). A hybrid genetic 
algorithm for feature selection wrapper based on mu- 
tual information. Patter recognition letters, (28) 13, 
1825-1844. 

Kim, Y., Street W. N. & Menczer, F. (2003). Feature 
selection in data mining. Data mining: opportunities 
and challenges, 80-105. IGI Publishing. 

Kira, K. & Rendell, L. (1992). The feature selection 
problem: traditional methods and new algorithm. Proc. 
AAAI'92, San Jose, CA. 

Kohavi, R. & John, G. (1997). Wrappers for feature 
subset selection. Artificial Intelligence, (97)1-2, 273- 
324. 

Liu, H. & Motoda, H. (1998). Feature extraction, 
construction and selection. A data mining perspective. 
Kluwer Academic Publishers. 




637 



Feature Selection 



Pearl, J. (2000). Casuality . Cambridge University 
Press. 

Pudil, P. and Novovicova, J. and Kittler, J. (1994). 
Floating search methods in feature-selection. Pattern 
Recognition Letters, (15) 11, 1119-1125. 

Robnik-Sikonja, M. & Kononenko, I. (2003). Theo- 
retical and empirical analysis of ReliefF and RRelief F. 
Machine Learning, (53), 23-69, Kluwer Academic 
Publishers. 

Salappa, A., Doumpos, M. & Zopounidis, C. (2007). 
Feature selection algorithms in classification problems: 
an experimental evaluation. Optimization Methods and 
Software, (22)1, 199-212. 

Sanchez-Marono, N., Caamano-Fernandez, M., Cas- 
tillo, E & Alonso-Betanzos, A.(2006). Functional 
networks and analysis of variance for feature selection. 
Proceedings of International Conference on Intel- 
ligent Data Engineering and Automated Learning, 
1031-1038. 

Shazzad, K.M & Jong S.P. (2005). Optimization of 
Intrusion Detection through Fast Hybrid Feature 
Selection. International Conference on Parallel and 
Distributed Computing, Applications and Technolo- 
gies, 264-267. 

Sebban, M., Nock, R. (2002). A hybrid filter/wrapper 
approach of feature selection using information theory. 
Patter recognition, (35)4:835-846. 

Yu, L. and Liu, H. (2003). Feature selection for high- 
dimensional data: A Fast Correlation-Based Filter 
Solution. Proceedings of The Twentieth International 
Conference on Machine Learning, 856-863. 

Yuan, H., Tseng, S.S., Gangshan, S. and Fuyan, Z. 
(1999). Two-phase feature selection method using both 
filter and wrapper. Proceedings of IEEE International 
Conference on Systems, Man, and Cybernetics, (2) 
132-136. 

Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., 
Poggio, T. and Vapnik, V. (2001). Feature selection for 



SVMs. Advances in Neural Information Processing 
Systems, (13). MIT Press. 

Zhao, Z. and Liu, H. (2007). Searching for interacting 
features. Proceedings oflnternationalJoint Conference 
on Artificial Intelligence, 1157-1161. 

Zhu, Z., Ong, Y, Dash, M. (2007) Wrapper-Filter Fea- 
ture Selection Algorithm Using a Memetic Framework. 
IEEE Transactions on Systems, Man and Cybernetics, 
Part B. (37) 1,70-76. 



KEY TERMS 

Dimensionality Reduction: The process of reduc- 
ing the number of features under consideration. The 
process can be classified in terms of feature selection 
and feature extraction. 

Feature Extraction: A dimensionality reduction 
method that finds a reduced set of features that are a 
combination of the original ones. 

Feature Selection: A dimensionality reduction 
method that consists of selecting a subset of relevant 
features from a complete set while ignoring the remain- 
ing features. 

Filter Method: A feature selection method that 
relies on the general characteristics of the training data 
to select and discard features. Different measures can 
be employed: distance between classes, entropy, etc. 

Hybrid Method: A feature selection method that 
combines the advantages of wrappers and filters meth- 
ods to deal with high dimensionality data. 

Sequential Backward (Forward) Selection 

(SBS/SFS): A search method that starts with all the 
features (an empty set of features) and removes (adds) 
a single feature at each step with a view to improving 
-or minimally degrading- the cost function. 

Wrapper Method: A feature selection method that 
uses a learning machine as a "black box" to score subsets 
of features according to their predictive value. 
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The answer to the theoretical question: "Can a machine 
be built capable of doing what the brain does?" is yes, 
provided you specify in a finite and unambiguous way 
what the brain does. 

Warren S. McCulloch 



INTRODUCTION 

The class of adaptive systems known as Artificial Neural 
Networks (ANN) was motivated by the amazing parallel 
processing capabilities of biological brains (especially 
the human brain). The main driving force was to re-cre- 
ate these abilities by constructing artificial models of 
the biological neuron. The power of biological neural 
structures stems from the enormous number of highly 
interconnected simple units. The simplicity comes 
from the fact that, once the complex electro-chemical 
processes are abstracted, the resulting computation 
turns out to be conceptually very simple. 

These artificial neurons have nowadays little in 
common with their biological counterpart in the ANN 
paradigm. Rather, they are primarily used as compu- 
tational devices, clearly intended to problem solving: 
optimization, function approximation, classification, 
time-series prediction and others. In practice few ele- 
ments are connected and their connectivity is low. This 
chapter is focused to supervised feed- forward networks. 
The field has become so vast that a complete and clear- 
cut description of all the approaches is an enormous 
undertaking; we refer the reader to (Fiesler & Beale, 
1997) for a comprehensive exposition. 



BACKGROUND 

Artificial Neural Networks (Bishop, 1995), (Haykin, 
1 994), (Hertz, Krogh & Palmer, 1991), (Hecht-Nielsen, 
1990) are information processing structures without 
global or shared memory, where each of the computing 
elements operates only when all its incoming infor- 
mation is available, a kind of data-flow architectures. 



Each element is a simple processor with internal and 
adjustable parameters. The interest in ANN is primar- 
ily related to the finding of satisfactory solutions for 
problems cast as function approximation tasks and 
for which there is scarce or null knowledge about the 
process itself, but a (limited) access to examples of 
response. They have been widely and most fruitfully 
used in a variety of applications — see (Fiesler & Beale, 
1 997) for a comprehensive review — especially after the 
boosting works of (Hopfield, 1 982), (Rumelhart, Hinton 
& Williams, 1986) and (Fukushima, 1980). 

The most general form for an ANN is a labelled 
directed graph, where each of the nodes (called units 
or neurons) has a certain computing ability and is 
connected to and from other nodes in the network 
via labelled edges. The edge label is a real number 
expressing the strength with which the two involved 
units are connected. These labels are called weights. 
The architecture of a network refers to the number of 
units, their arrangement and connectivity. 

In its basic form, the computation of a unit z is 
expressed as a function F- of its input (the transfer 
function), parameterized with its weight vector or lo- 
cal information. The whole system is thus a collection 
of interconnected elements, and the transfer function 
performed by a single one (i.e., the neuron model) is 
the most important fixed characteristic of the system. 

There are two basic types of neuron models in the 
literature used in practice. Both express the overall 
computation of the unit as the composition of two 
functions, as is classically done since the earlier model 
proposal of McCulloch & Pitts (1943): 

Ffic ) = ig{h{x,w { )), w^R n j, xeR n (1) 

where w- is the weight vector of neuron i, h:R n xR n ^ 
R is called the net input or aggregation function, and 
g:R-^>R is called the activation function. All neuron 
parameters are included in its weight vector. 

The choice /i(x,w-)=x-w-+0, where QeR is an offset 
term that may be included in the weight vector, leads 
to one of the most widely used neuron models. When 
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Figure 1. A classification problem. Left: Separation by spherical RBF units (R-neurons). Right: Separation by 
straight lines (P-neurons) in the MLR 



°d fo\ 





neurons of this type are arranged in a feed-forward 
architecture, the obtained neural network is called 
MultiLayer Perceptron (MLP) (Rumelhart, Hinton 
& Williams, 1986). Usually, a smooth non-linear and 
monotonic function is used as activation. Among them, 
the sigmoids are a preferred choice. 

The choice /i(x,w-)= ||x-w-||/# (or other distance 
measure), with Q>0<=R a smoothing term, plus an 
activation g with a monotonically decreasing response 
from the origin, leads to the wide family of localized 
Radial Basis Function networks (RBF) (Poggio & 
Girosi, 1989). Localized means that the units give a 
significant response only in a neighbourhood of their 
centre w-. A Gaussian g(z)=exp(-z 2 /2) is a preferred 
choice for the activation function. 

The previous choices can be extended to take into 
account extra correlations between input variables. The 
inner product (containing no cross-product terms) can be 
generalized to a real quadratic form (an homogeneous 
polynomial of second degree with real coefficients) or 
even further to higher degrees, leading to the so-called 
higher-order units (or Z-Q units). A higher-order unit 
of degree k includes all possible cross-products of at 
most k input variables, each with its own weight. Con- 
versely, basic Euclidean distances can be generalized 
to completely weighted distance measures, where all 
the (quadratic) cross-products are included. These full 
expressions are not commonly used because of the high 
numbers of free parameters they involve. 

These two basic neuron models have traditionally 
been regarded as completely separated, both from a 
mathematical and a conceptual point of view. To a 
certain degree, this is true: the local vs. global ap- 
proximation approaches to a function that they carry 



out make them apparently quite opposite methods (see 
Fig. 1). Mathematically, under certain conditions, they 
can be shown to be related (Dorffner, 1995). These 
conditions (basically, that both input and weight vec- 
tors are normalized to unit norm) are difficult to fulfil 
in practice. 

A layer is defined as a collection of independent units 
(not connected with one another) sharing the same input, 
and of the same functional form (same F- but different 
w •). Multilayer feed- forward networks take the form of 
directed acyclic graphs obtained by concatenation of 
a number of layers. All the layers but the last (called 
the output layer) are labelled as hidden. This kind of 
networks (shown in Fig. 2) compute a parameterized 
function F (x) of their input vector x by evaluating 
the layers in order, giving as final outcome the output 
of the last layer. The vector w represents the collection 
of all the weights (free parameters) in the network. 
For simplicity, we are not considering connections 
between non-adjacent layers (skip-layer connections) 
and assume otherwise total connectivity. The set of 
input variables is not counted as a layer. 

Output neurons take the form of a scalar product (a 
linear combination), eventually followed by an activa- 
tion function g. For example, assuming a single output 
neuron, a one-hidden-layer neural network with h hid- 
den units computes a function F:R n ^>R of the form: 



FJx)=g(^t C ; F ,<*)- ) 



(2) 



z=l 
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Figure 2. A two-hidden-layer example of ANN, mapping a three-dimensional input space x=(xl,x2,x3) to a two- 
dimensional output space (yl,y2)=Fw(x). The network has four and three units in the first and second hidden 
layers, respectively, and two output neurons. The vector w represents the collection of all the weights in the 
network. 
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where e R is an offset term (called the bias term), c • e R 
and g can be set as desired, including the choice g{z) -z. 
Such a feed-forward network has dim(w)=(n + 1 )h + h + 1 
parameters to be adjusted. 



FEED-FORWARD NEURAL NETWORKS 

The RBF and MLP networks provide parameterized 
families of functions suitable to function approximation 
on multidimensional spaces. A sigmoid neuron puts up 
an hyperplane that divides its input space in two halves. 
In other words, the points of equal neuron activation 
(with fixed weights) are hyperplanes. This behaviour 
is not caused by the sigmoid, but by the scalar product. 
The isoactivation contours for an RBF unit (in case of 
an unweighted Euclidean norm) are hyperspheres. The 
radially symmetric and centered response is not caused 
by the activation function (e.g. , Gaussian or exponential) 
but by the norm. In both cases, the activation function 
acts as anon-linear monotonic distorsion of its argument 
as computed by the aggregation function. 

Definition (Isoactivation set). Given a real function 
f:R n ^(a,b), define L a for ae(a,b) as the set of isoac- 
tivation points Ir a ={xeR n \f(x)=a}. 



Definition (P-neuron). A neuron model F- of the 
form: 



F I .(x)=f ff (w r x+9 I .),w I .ei? n ,e i .Ei?i 



(3) 



with g a bounded, non-linear and increasing func- 
tion for which lim^^ g(z)=g max eR and lim^.^ 
g(z)=g m - n eR will be denoted P-neuron (from Per- 
ceptron). For these neurons, the sets I Fi a are (n-1)- 
dimensional hyperplanes for constant values of a, 
parallel with one another for different a. In practice, 
the g are usually the well-behaved sigmoids, though 
other activation functions are sometimes found in the 
literature (e.g., sinusoid). The latter are not included 
in the above Definition. 

Definition (R-neuron). A neuron model F- of the 
form: 



F z .(x)=fl/e z .g(||x-w z .|| q ), w t eR n , QfOeR, q>leR} 



(4) 



where ||.|| is a norm and g is a symmetric function 
such that g(\z\) is monotonic, with a maximum g max 
at F-(w-) and a (possibly asymptotically reached) mini- 
mum g m i n = will be denoted P-neuron (from Radial). 
For these neurons, the sets I Fi a are (n-l)-dimensional 
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Figure 3. The logistic function l(z)=g og ^ ^(z) and its first derivative l'(z)>0. This function is maximum at the 
origin, corresponding to a medium activation at 1(0) =0.5. This point acts as an initial "neutral" value around 
a quasi-linear slope. 
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hypersurfaces (centered at w-) for constant values of 
a (e.g., hypercubes for q-1, hyperspheres for q-2) 
concentric with one another for different a. 

The norm used can be any Minkowskian norm of 
the form: 



~tanh 



cT n p (z)= 



exp(P(z-e))-exp(-P(z-9)) 
exp(p(z-0)) + exp(-P(z-9)) 



(7) 



(-1,1) 



ml=(X \ z i\ q ) 1/q > ^iei? 



(5) 



i=l 



In practice, typical choices are q-2 and g a Gauss- 
ian function. 

Due to their widespread use, we present two of the 
most popular sigmoids, and show how they are tightly 
related. Asigmoid function g can be defined as a mono- 
tonically increasing function exhibiting smoothness 
and asymptotic properties. The two more commonly 
found representatives are the logistic: 



1 



l+exp(-P(z-0)) 



(0,1) 



(6) 



and the hyperbolic tangent: 



The offset is in practice set to zero, because its 
function is the same as that of the bias term in the ag- 
gregation function in (3). These two families of func- 
tions can be made exactly the same shape (assuming 
0=0) by making the P in (6) be twice the value of the 
P in (7). For instance, for p= 0.5: 



g tanh ,(z) 



CXP( Z) - 2g l °\(z)-l 



= g tanh 1 (z/2) 



1 + exp(-z) 



(8) 



These 



is the bipolar version of g log 1 (z)= 1+ex p ( _ z) 
functions are chosen because of their simple analytic 
behaviour, especially in what concerns differentia- 
bility, of great importance for learning algorithms 
relying in derivative information (Fletcher, 1980). In 
particular, 



-Jos 



(g l0g p)'(2) = Pg l08 p(z)(i-g 10g R (z)) 



(9) 
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The interest in sigmoid functions also relies in the 
behaviour of their derivatives. Consider, for example, 
(6) with p= 1.5 and 0=0, plotted inFig. (3). The deriva- 
tive of a sigmoid is always positive. For 0=0, all the 
functions are centred at z=0. In this point, the function 
has a medium activation, and its derivative is maximum, 
allowing for maximum weight updates. 

Types of Artificial Neural Networks 

Af undamental distinction to categorize a neural network 
relies on the kind of architecture, basically divided in 
feed-forward (for which the graph contains no cycles) 
and recurrent (the rest of situations). A very common 
feed-forward architecture contains no intra-layer 
connections and all possible inter-layer connections 
between adjacent layers. 

Definition (Feed-forward neural network: struc- 
ture). A bipartitioned graph is a graph G whose nodes 
V can be partitioned in two disjoint and proper sets V± 
and V 2 , V^uV 2 = V, in such a way that no pair of nodes 
in V± is joined by an edge, and the same property holds 
for V 2 . We write then G nl n2 , with n^- |V r 1 |,n 2 = |V 2 |. A 
bipartitioned graph G nl n2 is complete if every node in 
V± is connected to every node in V 2 . These concepts 
can be generalized to an arbitrary number of partitions, 
as follows: A /(-partitioned graph G nl nk is a graph 
whose nodes V can be partitioned in k' disjoint and 
proper sets Vp...,Vj c such that 

U v r v. 



in a way that no pair of nodes in V- is joined by an edge, 
for all 1< i< k. In these conditions, a feed-forward fully 
connected neural network with c hidden layers and fy 
units per layer /, 1< /< c+ 1, takes the form of a directed 
complete c+ 1 -partitioned graph G^ ^ c+1 . 

Definition (Feed-forward neural network: func- 
tion). A feed-forward neural network consisting of 
c hidden layers, denoted FFNN(n,c,m), is a func- 
tion F w : R n ^R m made up of pieces of the form 
y(0 = (F 1 '(y('- 1 )),...,F w / (y( / - 1 ))), representing the 
output of layer /, for 1< /< c+1. The F denote the 
neuron model of layer / and fyeiV* their number, 
and each neuron F- has its own parameters w^ • as 



in Definitions 2 and 3, which are collectively grouped 
in the network parameters w. The first output is defined 
aS y(0) =x p or t h e j ast (output) layer, h + 1 =m and the 




7C+1 



j, 1< /< h c+ ^ are P-neurons or linear units (obtained 
by removing the activation function in a P-neuron). The 
final outcome for F (x) is the value ofy' c+1 ). 

Definition (MLPNN).AMultiLayerPerceptronNeural 
Network is a FFNN (n,c,m) for which c>l and all the 
F are P-neurons, 1 < / < c. 

Definition (RBFNN). A Radial Basis Function Neural 
Network is a FFNN (n,c,m) for which c=l and all the 
F c are R-neurons. 



LEARNING IN ARTIFICIAL NEURAL 
NETWORKS 

A system can be said to learn if its performance on a 
given task improves with respect to some measure as 
a result of experience (Rosenblatt, 1962). In ANNs the 
"experience" is the result of exposure to a training set 
of data, accompanied with weight modifications. The 
main problem tackled in supervised learning is regres- 
sion, the approximation of an n-dimensional function 
f: Xcz R n ^>>R m by finite superposition (composition 
and addition) of known parameterized base functions, 
like those in (3) or (4). Their combination gives rise 
to expressions of the form F (x). The interest is in 
finding a parameter vector w of size s such that F w *(x) 
optimizes a cost functional L (f, F ) called the loss: 



w*=argmin W(ER s L (f, FJ 



(10) 



The only information available is a finite set D of 
p noisy samples of f D=f <x-,y->,f(x-)+8 z -=y z -j, where 
x . g R n is the stimulus,^ • e R m is the target, 8 • is the noise 
(assumed additive) and \D\=p. An estimation of L (f, 

F ) can be obtained as L (D, F ), the apparent loss, 



computed separately for each sample in D, 



L (D,F W )= £ WiJJxj)) 



(11) 



(Xi,Yi)eD 
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A common form for X is an error function, as the 
squared-error X(a,b)=(a-b) 2 . This results from the 
assumption that the noise follows a homocedastic 
gaussian distribution with zero mean. When using 
this error, the expression (11) can be viewed as the 
(squared) Euclidean norm in RP of the p-dimensional 
error vector e=(e 1 ,...,e ), known as the sum-of-squares 
error, with e f =y r F w (x f ) ? as: 



L(D,FJ= X (y r F^(x z .)) 2 = e- e = ||e|| 2 

(12) 

The usually reported quantity | \e\\ /p is called mean 
square error (MSE), and is a measure of the empirical 
error (as opposed to the unknown true error). We shall 

denote the error function E(w)= L (D,F W ). In a training 
process, the network builds an internalTepresentation 
of the target function by finding ways to combine the 
set of base functions {F-(x)}-. The validity of a solu- 
tion is mainly determined by an acceptably low and 

balanced L (D,FJ and L (D oup F w ), f or any D QUt c 
X\D (where D QUt ~ls not used in theTearning process) 
to ensure that f has been correctly estimated from the 
data. Network models too inflexible or simple or, on 
the contrary, too flexible or complex will generalize 
inadequately. This is reflected in the bias-variance 
tradeoff: the expected loss for finite samples can be 
decomposed in two opposing terms called error bias 
and error variance (Geman, Bienenstock & Doursat, 
1992). The expectation for the sum-of-squares error 
function, averaged over the complete ensemble of data 
sets D is written as (Bishop, 1995): 



conditional average of the target y=f(x) (which ex- 
presses the optimal network mapping), given by: 



E(w) 



E D {(F w (x)-<y\x>) 2 } 



= (E D f(F„(x)-<y|x>}) 2 + E D f(F„(x)- 
E D {FJx)}) 2 } 

(13) 

where E D is the expectation operator taken over every 
data set of the same size as D and <y|x> denotes the 



<y|x >= j yp(y\*)dy 



The first term in the right hand side of (13) is the 
(squared) bias and the second is the variance. The bias 
measures the extent to which the average (over all D) 
of F (x) differs from the desired target function <y |x > . 
The~variance measures the sensitivity of F (x) to the 
particular choice of D. Too inflexible or simple models 
will have a large bias, while too flexible or complex 
will have a large variance. These are complementary 
quantities that have to be minimized simultaneously; 
both can be shown to decrease with increasing avail- 
ability of larger data sets D. 

The expressions in (13) are functions of an input 
vector x. The average values for bias and variance 
can be obtained by weighting with the corresponding 
density p(x): 

J E D {(FJx)-<y\x>) 2 }p(x)dx 

= J (E D {(F K (x)-<y\x>}) 2 p(x)dx 

+ J E D {(F„(x)-E D {F„(x)}) 2 }p(x)dx 

(14) 

Key conditions for acceptable performance on 
novel data are given by a training set D as large and 
representative as possible of the underlying distribu- 
tion, and a set D QUt of previously unseen data which 
should not contain examples exceedingly different from 
those in D. An important consideration is the use of 
a net with minimal complexity, given by the number 
of free parameters (the number of components in w). 
This requirement can be realized in various ways. In 
regularization theory, the solution is obtained from 
a variational principle including the loss and prior 
smoothness information, defining a smoothing func- 
tional § such that lower values correspond to smoother 
functions. A solution of the approximation problem is 
then given by minimization of the functional (Girosi, 
Jones & Poggio, 1993): 
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H(FJ= L (D,F W )+T,<|>(F W ) 



(15) 



where r| is a positive scalar controlling the tradeoff be- 
tween fitness to the data and smoothness of the solution. 
A common choice is the second derivative P(f) = f " of 
which the (squared) Euclidean norm is taken: 

^w)=ll P ( F w)H 2= j ^w"(0i 2 dt (16) 

CONCLUSION 

Artificial Neural Networks are information processing 
structures evolved as an abstraction of known principles 
of how the brain might work. The computing elements, 
called neurons, are linked to one another with a certain 
strength, called weight. In their simplest form, each unit 
computes a function of its inputs — which are either the 
outputs from other units or external signals — influenced 
by the weights of the links conveying these inputs. The 
network is said to learn when the weights of all the 
units are adapted to represent the information present 
in a sample, in an optimal sense given by an error 
function. The network relies upon the representation 
capacity of the neuron model as the cornerstone for a 
good approximation. 
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KEY TERMS 

Architecture: The number of artificial neurons, its 
arrangement and connectivity. 

Artificial Neural Network: Information processing 
structure without global or shared memory that takes the 
form of a directed graph where each of the computing 
elements ("neurons") is a simple processor with internal 
and adjustable parameters, that operates only when all 
its incoming information is available. 
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Bias-Variance Tradeoff: The mean square error (to 
be minimized) decomposes in a sum of two non-nega- 
tive terms, the squared bias and the variance. When an 
estimator is modified so that one term decreases, the 
other term will typically increase. 

Feed-Forward Artificial Neural Network: Artifi- 
cial Neural Network whose graph has no cycles. 

Learning Algorithm: Method or algorithm by vir- 
tue of which an Artificial Neural Network develops a 
representation of the information present in the learning 
examples, by modification of the weights. 

Neuron Model: The computation of an artificial 
neuron, expressed as a function of its input and its 
weight vector and other local information. 

Weight: A free parameter of an Artificial Neural 
Network, that can be modified through the action of 
a Learning Algorithm to obtain desired responses to 
input stimuli. 
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INTRODUCTION 



MOTIVATION 



Traditionally, the Evolutionary Computation (EC) tech- 
niques, and more specifically the Genetic Algorithms 
(GAs) (Goldberg & Wang, 1989), have proved to be 
efficient when solving various problems; however, as a 
possible lack, the GAs tend to provide a unique solution 
for the problem on which they are applied. Some non 
global solutions discarded during the search of the best 
one could be acceptable under certain circumstances. 
The majority of the problems at the real world involve 
a search space with one or more global solutions and 
multiple local solutions; this means that they are mul- 
timodal problems (Harik, 1995) and therefore, if it is 
desired to obtain multiple solutions by using GAs, it 
would be necessary to modify their classic functioning 
outline for adapting them correctly to the multimodality 
of such problems. 



This chapter tries to establish the basis for the under- 
standing of multimodality where, firstly, the characteri- 
sation of the multimodal problems will be attempted. 
It would be also tried to offer a global view of some 
of the several approaches proposed for adapting the 
classic functioning of the GAs to the search of multiple 
solutions. Lastly, the contributions of the authors will 
be also showed. 



BACKGROUND: CHARACTERIZATION 
OF MULTIMODAL PROBLEMS 

The multimodal problems can be briefly defined as 
those problems that have multiple global optimums 
or multiple local optimums. 

For this type of problems, it is interesting to obtain 
the greatest number of solutions due to several reasons; 
on one hand, when there is not a total knowledge of the 



Figure 1. Rastrigin function 
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problem, the solution obtained might not be the best 
one as it can not be stated that no better solution could 
be found at the search space that has not been explored 
yet. On the other hand, although being certain that the 
best solution has been achieved, there might be other 
equally fitted or slightly worst solutions that might be 
preferred due to different factors (easier application, 
simpler interpretation, etc.) and therefore considered 
globally better. 

One of the most characteristic multimodal func- 
tions used in lab problems are the Rastrigin function 
(see Fig. 1) which offers an excellent graphical point 
of view about multimodality means. 

Providing multiple optimal (and valid) solutions 
and not only the unique global solution is crucial in 
multiple environments. Usually, it is very complex to 
implement in the practice the best solution represents, 
so it can offers multiple problems: computational cost 
too high, complex interpretation,... 

In these situations it turns out useful to have a 
range of valid solutions between which that one could 
choose that, still not being the best solution to the raised 
problem, offer a level of acceptable adjustment and be 
simpler to implement, to understand, . . . that the ideal 
global one. 



EVOLUTIONARY TECHNIQUES AND 
MULTIMODAL PROBLEMS 

As it has been mentioned, the application of EC tech- 
niques to the resolution of multimodal problems sets 
out the difficulty that this type of techniques shows 
since they tend to solely provide the best of the found 
solutions and to discard possible local optimums that 
might have been found throughout the search. Quite 
many modifications have been included in the traditional 
performance of the GAin order to achieve good results 
with multimodal problems. 

A crucial aspect when obtaining multiple solu- 
tions consists on keeping the diversity of the genetic 
population, distributing as much as possible the genetic 
individuals throughout the search space. 



CLASSICAL APPROACHES 

Nitching methods allow GAs to maintain a genetic 
population of diverse individuals, so it is possible 



to locate multiple optimal solutions within a single 
population. 

In order to minimise the impact of homogenisation, 
or to tend that it may only affect later states of searching 
phase, several alternatives have been designed, based 
most of them on heuristics. One of the first alternatives 
for promoting the diversity was the applications of scal- 
ing methods to the population in order to emphasize 
the differences among the different individuals. Other 
direct route for avoiding the diversity loss involves 
focusing on the elimination of duplicate partial high 
fitness solutions (Bersano, 1997) (Langdon, 1996). 

Some other of the approaches tries to solve this 
problem by means of the dynamic variation of crossover 
and mutation rates (Ursem, 2002). A higher amount of 
mutations are done in order to increase the exploration 
through the search space, when diversity decreases; the 
mutations decrease and crossovers increase with the aim 
of improving exploitation in optimal solution search 
when diversity increases. There are also proposals of 
new genetic operators or variations of the actual ones. 
For example some of the crossover algorithms that 
improve diversity and that should be highlighted are 
BLX (Blend Crossover) (Eshelman & Schaffer, 1 993), 
SBX (Simulated Binary Crossover) (Deb & Agrawal, 
1995), PCX (Parent Centric Crossover) (Deb, Anand 
& Joshi, 2002), CIXL2 (Confidence Interval Based 
Crossover using L2 Norm) (Ortiz, Hervas & Garcia, 
2005) or UNDX (Unimodal Normally Distributed 
Crossover) (Ono & Kobayashi, 1999). 

Regarding replacement algorithms, schemes that 
may keep population diversity have been also looked 
for. An example of this type of schemes is crowding 
(DeJong, 1975)(Mengshoel & Goldberg, 1999). Here, 
a newly created individual is compared to a randomly 
chosen subset of the population and the most closely 
individual is selected for replacement. Crowding tech- 
niques are inspired by Nature where similar members 
in natural populations compete for limited resources. 
Likewise, dissimilar individuals tend to occupy differ- 
ent niches and are unlikely to compete for the same 
resource, so different solutions are provided. 

Fitness sharing was firstly implemented by Gold- 
berg & Richardson for being used on multimodal 
functions (Goldberg & Richardson, 1999). The basic 
idea involves determining, from the fitness of each 
solution, the maximum number of individuals that can 
remain around it, awarding the individuals that exploit 
unique areas of the domain. The dynamic fitness shar- 
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ing (Miller & Shaw, 1995) with two components was 
proposed in order to correct the dispersion of the final 
distribution of the individuals into niches: the distance 
function, which measures the overlapping of individu- 
als, and the comparison function, which results "1" if 
the individuals are identical and values closer to "0" 
as much different they are. 

The clearing method (Petrowski, 1996) is quite 
different from the previous ones, as the resources are 
not shared, but assigned to the best individuals, who 
will be then kept at every niche. 

The main inconvenience of the techniques previ- 
ously described lies in the fact that they add new param- 
eters that should be configured according the process 
of execution of GA. This process may be disturbed by 
the interactions among those parameters (Ballester & 
Carter, 2003). 



OWN PROPOSALS 

Once detected the existing problems they should be 
resolved, or at least, minimized. With this goal, the Ar- 
tificial Neural Network and Adaptive System (RNAS A) 
group have developed two proposals that use EC tech- 
niques for this type of problems. Both proposals try 
to find the final solution but keeping partial solutions 
within the final population. 

The main ideas of the two proposals, together with 
the problems used for the tests are explained at the 
following points. 



Hybrid Two-Population Genetic 
Algorithm 

Introduction 

To force a homogeneous search throughout the search 
space, the approach proposed here is based on the 
addition of a new population (genetic pool) to a tra- 
ditional GA (secondary population). The genetic pool 
will divide the search space into sub-regions. Every 
one of the individuals of the genetic pool has its own 
fenced range for gene variation, so every one of these 
individuals would represent a specific sub-region within 
the global search space. On the other hand, the group 
of individual ranges in which any gene may have its 
value, is extended over the whole of those possible 
values that a gene may have. Therefore, this genetic 
pool would sample the whole of the search space. 

It should be borne in mind that a traditional GA 
performs its search considering only one sub-region 
(the whole of the search space). Here the search space 
will be divided into different subregions or intervals 
according to the number of genetic individuals in the 
genetic pool. 

Since the individuals in the genetic pool have restric- 
tions in their viable gene values, one of these individuals 
would not be provided a valid solution. So, it is also 
used another population (the secondary population) 
in addition to the genetic pool. Here, a classical GA 
would develop its individuals in an interactive fashion 
with those individuals of the genetic pool. 

Unlike at genetic pool, the genes of individuals of 
secondary population may adopt values throughout the 




Figure 2. Structure of populations of hybrid two-population genetic algorithm 
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whole of the search space, so it would contribute the 
solutions, whereas the genetic pool would act as a sup- 
port, keeping search space homogeneously explored. 

The secondary population will provide the solutions 
(since its individuals are allowed to vary along all the 
search space range), whereas the genetic pool would 
act as a support, keeping search space homogeneously 
explored. 

Next, both populations, which are graphically rep- 
resented in Fig. 2, will be described in detail. 

The Genetic Pool 

As it has been previously mentioned, every one of the 
individuals at the genetic pool represents a sub-region 
of the global search space. Therefore, they should have 
the same structure or gene sequence than when using 
a traditional GA. The difference lies in the range of 
values that these genes might have. 

When offering a solution, traditional GAmay have 
any valid value, whereas in the proposed G A, the range 
of possible values is restricted. Total value range is 
divided into the same number of parts than individuals 
in genetic pool, so that a sub-range of values is allotted 
to each individual. Those values that a given gene may 
have will remain within its range for the whole of the 
performance of the proposed GA. 

In addition to all that has been said, every individual 
at the genetic pool will be in control of which are the 
genes that correspond to the best found solution up to 
then (meaning whether they belong to the best indi- 
vidual at secondary population). This Boolean value 
would be used to avoid the modification of those genes 
that, in some given phase of performance, are the best 
solution to the problem. 

Furthermore, every one of the genes in an individual 
has an I value associated which indicates the relative 
increment that would be applied to the gene during 
a mutation operation based only on increments and 
solely applied to individuals of the genetic pool. It is 
obvious that this incremental value should have to be 



lower than the maximum range in which gene values 
may vary. The structure of the individuals at genetic 
pool is shown at Fig. 3. 

As these individuals do not represent global solu- 
tions to the problem that has to be solved, so their 
fitness value will not be compulsory. It will reduce 
the complexity of the algorithm and, of course, it 
will increase the computational efficiency of the final 
implementation. 

The Secondary Population 

The individuals of the secondary population are quite 
different for the previous. In this case, the genes of the 
individuals on the secondary population can take any 
value throughout the whole space of possible solutions. 
This allows that all individuals on secondary popula- 
tion are able to offer global solutions to the problem. 
This is not possible in genetic pool because their genes 
were restricted to different sub-ranges. 

The evolution of the individuals at the genetic pool 
will be carried out by a traditional GA rules. The main 
different lies in the operator crossover. In this case a 
modified crossover will be used. Due to the information 
is stored in isolated population, now the two parents 
who will produce the new offspring will not belong 
to the same population. Hence, the genetic pool and 
secondary population are combined instead. In this 
way information of both populations will be merged 
to produce the most fitted offspring. 

The Crossover Operator 

As it was pointed before the crossover operator recom- 
bines the genetic material of the individuals of both 
populations. This recombination involves a random 
individual from secondary population with a repre- 
sentative of the genetic pool. 

This representative will represent a potential solu- 
tion offered by the genetic pool. As a unique individual 
can not verify this requirement, the representative will 



Figure 3. Structure of the genetic pool individuals 
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Figure 4. Hybrid two-population genetic algorithm: Crossover 



Uniform 
Crossover 



/y n\ Representative 

J> (First Parent) 






Secondary Population 




Genetic Pool 



be formed by a subset of genes of different individuals 
on the genetic pool. Gathering information from dif- 
ferent partial solutions will allow producing a valid 
global solution. 

Therefore, the value for every gene of the representa- 
tive will be randomly chosen among all the individuals 
in the genetic pool. After a value is assigned to all the 
genes, this new individual represents not a partial, 
unlike every one of the individuals separately, but a 
global solution. 

Now, the crossover operator will be applied. This 
crossover function will keep the secondary population 
diversity, so the offspring will contain values from the 
genetic pool. Therefore the genetic algorithm would 
be able to maintain multiple solutions in the same 
population. The crossover operator does not change 
the genetic pool because the last one only acts as an 
engine to keep the diversity 

This process is summarized in Fig 4. 

The Mutation Operator 

Mutation operator increments the value of individual 
genes in the genetic pool. It introduces new information 
in the genetic pool, so the representative can use it and 
finally, by means of the crossover operator, introduce 
it in secondary population. 

It should be noted that the new value will have upper 
limit, so when it is reached the new gene value will be 
reset to the lower value. 

When generations advance the increment amount is 
reduced, so the increment applied to the individuals in 
the genetic pool will take lower values. The different 
increments between iterations are calculated taking in 
mind the lower value for a gene (LIMINFIND), the 



upper value for that gene (LIMSUPIND) and the total 
number of individuals in the genetic pool (IND_POOL) 
as Fig. 5 summarize. In such way, first generations will 
explore the search space briefly (a coarse-grain search) 
and it is intended to do a more exhaustive route through 
all the values that a given gene may have (a fine-grain 
search) as the search process advance. 

Genetic Algorithm with Division into 
Species 

Another proposed solution is an adaptation of the nitch- 
ing technique. This adaptation consists on the division 
of the genetic population into different and independent 
subspecies. In this case the criterion that determines the 
specie for a specific individual to concrete specie is done 
according to genotype similarities (similar genotypes 
will form isolated species). This classical concept has 
been provided with some improvements in order to, 
not only decrease the number of iterations needed for 
obtaining solutions, but also increase the number of 
solutions kept within the genetic population. 

Several iterations of the GA were executed on 
every species of the genetic population for speeding 
up the convergence towards the solution that exists 
near every species. The individuals generated during 
this execution having a genotype of a different species 
will be discarded. 

The crossover operations between the species are 
following applied similarly to what happens in biol- 
ogy. It origins, on one hand, the crossovers between 
similar individuals are preferred (as it was done at the 
previous step using GAs) and on the other, the cross- 
overs between different species are enabled, although 
in a lesser rate. 
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Figure 5. Pseudocode for mutation and Delta initialization 
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The individuals generated after these crossovers 
could, either be incorporated to an already existing 
species or, if they analyse a new area of the search 
space, create themselves a new species. 

Finally, the GA provides as much solutions as spe- 
cies remains actives over the search space. 



the final user to decide which of them is the most suit- 
able in any particular case. 

The final decision will depend on several factors, not 
only the global error reached for a particular method. 
Other factors also depend on the economic impact, the 
difficulty to implement it, the quality of the knowledge 
provided for their analysis, and so on. 



FUTURE TRENDS 

Since there are not any methods that provide the best 
results in all the possible situations, new approaches 
would be developed. 

New fitness functions would help to locate a great 
number of valid solutions within the search space. 
In the described approaches this functions remains 
constants over the method execution. Another option 
would be allow dynamical fitness functions that vary 
along the execution stage. These kind of functions will 
try to adapt their output with the knowledge extracted 
from the search space while the crossover and mutation 
operators explore new arenas. 

If different techniques offer acceptable solutions, 
other interesting approach an interesting point consists 
on putting together. For example, this hybrid models 
would integrate statistics methods (with a great math- 
ematical background) with other heuristics. 



CONCLUSION 

This article shows an overview of the different methods 
related with evolutionary techniques used to address the 
problem of multimodality. This chapter showed several 
approaches to provide, not only a global solution, but 
multiple solutions to the same problem. It would help 
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KEY TERMS 

Crossover: Genetic operation included in evolu- 
tionary techniques used to generate the offspring from 
current population. There are very different methods 
to perform crossover, but the general idea resides in 
merging the genetic information of the parents within 
the offspring with the aim of produce better solutions 
as generations advance. 

Evolutionary Technique: Technique which tries 
to provide solutions for a problem guided by biologi- 
cal principles such as the survival of the fittest. This 
kind of techniques starts from a randomly generated 
population which evolves by means of crossover and 
mutation operations to provide the final solution. 

Genetic Algorithm: A special type of evolutionary 
technique which represents the potential solutions of 
a problem within chromosomes (usually a collection 
of binary, natural or real values). 

Multimodal Problems : A special kind of problems 
where a unique global solution does not exist. Several 
global optimums or one global optimum with several 
local optimums (or peaks) can be found around the 
search space. 

Mutation: The other genetic operation included in 
evolutionary techniques to perform the reproduction 
stage. Mutation operator introduces new information 
in the system by random changes applied within the 
genetic individuals. 

Search Space: Set of all possible situations of the 
problem that we want to solve could ever be in. Com- 
bination of all the possible values for all the variables 
related with the problem. 

Species: Within the context of genetic algorithm, 
a subset of genetic individuals with similar genotype 
(genetic values) which explore the same, or a similar, 
area of the search space. 
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INTRODUCTION 

Current databases are able to store several Tbytes of 
free-text documents. The main purpose of a database 
from the user's viewpoint is the efficient informa- 
tion retrieval. In the case of textual data, information 
retrieval mostly concerns the selection and the rank- 
ing of documents. The selection criteria can contain 
elements that apply to the content or the grammar of 
the language. In the traditional database management 
systems (DBMS), text manipulation is restricted to 
the usual string manipulation facilities, i.e. the exact 
matching of substrings. Although the new SQL 1999 
standard enables the usage of more powerful regular 
expressions, this traditional approach has some major 
drawbacks. The traditional string-level operations are 
very costly for large documents as they work without 
task-oriented index structures. 

The required full-text management operations be- 
long to text mining, an interdisciplinary field of natural 
language processing and data mining. As the traditional 
DBMS engine is inefficient for these operations, data- 
base management systems are usually extended with 
a special full-text search (FTS) engine module. We 
present here the particular solution of Oracle; there for 
making the full-text querying more efficient, a special 
engine was developed that performs the preparation 
of full-text queries and provides a set of language and 
semantic specific query operators. 



on the market on the usage of free text and text mining 
operations, since information is often stored as free 
text. Typical application areas are, e.g., text analysis 
in medical systems, analysis of customer feedbacks, 
and bibliographic databases. In these cases, a simple 
character-level string matching would retrieve only 
a fraction of related documents, thus an FST engine 
is required that can identify the semantic similarities 
between terms. 

There are several alternatives for implementing an 
FTS engine. In some DBMS products, such as Oracle, 
Microsoft SQLServer, Postgres, and mySQL, a built- 
in FTS engine module is implemented. Some other 
DBMS vendors extended the DBMS configuration with 
a DBMS-independent FTS engine. In this segment the 
main vendors are: SPSS LexiQuest (SPSS, 2007), SAS 
Text Miner (SAS, 2007), dtSearch (dtSearch, 2007), 
and Statistica Text Miner (Statsoft, 2007). 

The market of FTS engines is very promising since 
the amount of textual information stored in databases 
rises steadily. According to the study of Meryll Lynch 
(Blumberg & Arte, 2003), 85% of business information 
are text documents - e-mails, business and research 
reports, memos, presentations, advertisements, news, 
etc. - and their proportion still increases. In 2006, 
there were more than 20 billion documents available 
on the Internet (Chang, 2006). The estimated size of 
the pool increases to 550 billion documents when the 
documents of the hidden (or deep) web - which are e.g. 
dynamically generated ones - are also considered. 



BACKGROUND 

Traditional DBMS engines are not adequate to meet 
the users ' requirements on the management of free-text 
data as they handles the whole text field as an atom 
(Codd, 1985). A special extension to the DBMS en- 
gine is needed for the efficient implementation of text 
manipulating operations. There is a significant demand 



TEXT MINING 

The subfield of document management that aims at 
processing, searching, and analyzing text documents 
is text mining. The goal of text mining is to discover 
the non-trivial or hidden characteristics of individual 
documents or document collections. Text mining is an 
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Figure 1. The text mining module 
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application oriented interdisciplinary field of machine 
learning which exploits tools and resources from com- 
putational linguistics, natural language processing, 
information retrieval, and data mining. 

The general application schema of text mining is 
depicted in Figure 1 (Fan, Wallace, Rich & Zhang, 
2006). For giving a brief summary of text mining, four 
main areas are presented here: information extraction, 
text categorization/classification, document clustering, 
and summarization. 

Information Extraction 

The goal of information extraction (IE) is to collect the 
text fragments (facts, places, people, etc.) from docu- 
ments relevant to the given application. The extracted 
information can be stored in structured databases. IE 
is typically applied in such processes where statistics, 
analyses, summaries, etc. should be retrieved from 
texts. IE includes the following subtasks: 

named entity recognition - recognition of specified 
types of entities in free text, see e.g. Borthwick, 
1999; Sibanda & Uzuner, 2006, 
• co-reference resolution - identification of text 
fragments referring to the same entity, see e.g. 
Ponzetto & Strube, 2006, 
identification of roles and their relations - deter- 
mination of roles defined in event templates, see 
e.g. Ruppenhofer et al, 2006. 

Text Categorization 

Text categorization (TC) techniques aim at sorting 
documents into a given category system (see Sebastiani, 
2002 for a good survey). In TC, usually, a classifier 



model is built based on the content of a set of sample 
documents, which model is then used to classify unseen 
documents. Typical application examples of TC include 
among many others: 

document filtering - such as e.g. spam filtering, 
or newsfeed (Lewis, 1995); 
patent document routing - determination of ex- 
perts in the given fields (Larkey, 1999); 
assisted categorization - helping domain experts 
in manual categorization with valuable sugges- 
tions (Tikk et al, 2007), 

automatic metadata generation (Liddy et al, 
2002), 

Document Clustering 

Document clustering (DC) methods group elements 
of a document collection based on their similarity. 
Here again, documents are usually clustered based on 
their content. Depending on the nature of the results, 
one can have partitioning and hierarchical clustering 
methods. In the former case, there is no explicit relation 
among the clusters, while in the latter case a hierarchy 
of clusters is created. DC is applied for e.g.: 

• clustering the results of (internet) search for 
helping users in locating information (Zamir et 
al, 1997), 

improving the speed of vector space based infor- 
mation retrieval (Manning et al, 2007), 
providing a navigation tool when browsing a 
document collection (Kaki, 2005). 
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Summarization 

Text summarization aims at the automatic generation 
of short and comprehensible summaries of documents. 
Text extraction algorithms create summary by extract- 
ing relevant descriptive phrases (typically sentences) 
from the original text, while summaries generated by 
abstraction methods may contain synthesized text as 
well. The typical application areas of summarization 
span from the internet search to arbitrary document 
management system (Ganapathiraju, 2002; Radev et 
al;2001). 



FULL-TEXT SEARCH (FTS) ENGINES 
Full-Text Search 

Based on the literature (Maier, 200 1 , Curtmola, 2005), 
an effective FTS engine should support several query 
functionalities. The simplest operation is the string- 
based query, which retrieves texts that exactly match 
the query string. In some cases, the position of the 
keywords within the document is also an important 
factor. The simplest form of similarity-based matching 
uses the edit-distance function. The next operation is 
the content-based query, where similarity is defined on 
the semantic level. An FTS engine should also support 
grammar (and therefore language) specific operators 
(e.g. stemming). The highest level of text search oper- 
ates with semantic-based matching (thesaurus-based 
neighborhood, generalization of a word, specializa- 
tion, synonyms). From the practical viewpoint, the 
efficient execution of queries is also very important. 
Due to the heterogeneity of the source pool, the sup- 
port of different document formats is a key require- 
ment. The minimal usage of other resources provides 
an independent, flexible solution. From the aspect of 
software development, the open, standardized interface 
is a good investment. To provide a manageable, easy 
to understand response, the efficient ranking of the 
result set is crucial (Chakrabarti, 2006). The products 
and test systems currently available only partially meet 
the above requirements. 

Structure of a General FTS Engine 

FTS engines are structurally similar to database sys- 
tems: they store data and metadata; their purpose is to 



provide an efficient information retrieval (Microsoft, 
2007; Oracle Text, 2007). As the processing of a full-text 
query requires several distinct steps, the FTS engines 
typically have modular structure (see also Figure 2.). 

The loader module loads the documents into a 
common staging area, into a common representa- 
tion. In further steps, data items are transformed into 
a common format, too. The loaded documents are 
stored in the datastore unit. Document processing has 
several steps. The sectioner unit has to discover the 
larger internal logical structure of the documents. The 
word-breaker parses the text into smaller syntactical 
units like paragraphs, sentences and terms (words). For 
reducing the length and complexity of the text, several 
preprocessing steps are executed. First, a filter module 
is applied that discard irrelevant words (stop-words, 
noise words). Next, the stemmer unit generates the stem 
form for every word. In the background, the language 
lexicon supports the language-specific reduction steps. 
This lexicon contains the grammar of the supported 
languages and the list of stop-words. The thesaurus 
is a special lexicon, which stores the terms organized 
in a graph based on their semantic relationship. To 
provide an efficient term management, several kinds 
of indexes are created. The indexer unit manages the 
different document-term indices that enable the efficient 
access to term occurrences. On the front-end side, the 
query preprocessor transforms the user's query into 
an internal format. This format is processed by the 
query matcher, resulting in a set of matching docu- 
ments. The search engine may be extended with a text 
mining module that performs data mining operations, 
like clustering or classification. In order to provide a 
more accurate response, the query refinement engine 
performs the processing of relevance feedback. The 
list of matching documents is pipelined to the ranking 
module. The exporter module generates the final format 
of the ranked document set. 

As mentioned, database systems use indices for the 
fast access to data items. For full-text search, the in- 
verted index is the most efficient index structure (Zobel, 
2006). In the simple inverted index, the key of the index 
is the term. Each key is associated with a pair (df, dl). 
Here df is the number of documents containing the key, 
and dl is the list of documents that contain the key. Each 
entry in the list contains a document identifier and the 
frequency value in the document. The position-based 
inverted index differs from the simple version as that 
the list corresponding to a document also contains the 
positions of the given term in the text. 
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Figure 2. Modules of an FTS engine 
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FTS Engine Interface in Oracle Text 

The FTS functionality in Oracle Text (Oracle, 2007) 
can be activated with some extensions to SQL and 
with procedural SQL packages. Oracle Text supports 
four index types: 

• CONTEXT-type index: inverted index for long 
documents; 

CTXCAT-type index: to support content- and 
attribute -based indexing for shorter documents; 
CTXRULE-type index: rules for document clus- 
tering; 

CTXPATH-type index: indexing of XML docu- 
ments. 

The stemming module supports only two languages: 
English and French. In the queries, the CONTAINS 
operator supports the following matching modes: 

keyword: exact matching; 

AND, OR, NOT : Boolean operators; 

NEAR (keyword 1, keyword2): the keywords 

should occur at near positions in the document; 

BT(keyword): generalization of the keyword; 



NT(keyword): specialization of the keyword; 
REL(key word) : words in the thesaurus in relation 
with the keyword; 

SYN(keyword): the synonyms of the keyword; 
$keyword: words having the same stem; 
! keyword: words having the same pronuncia- 
tion; 

ABOUT keywords : words belonging to the given 
topic; 

FUZZ Y(key word): words that are close to the 
keyword in terms of the edit distance; 
WITHIN (section): the matching is restricted to 
a given section of the documents. 

The example below retrieves the documents contain- 
ing words that have similar meaning as "food": 

SELECT description FROM books WHERE CON- 
TAINS (description, 'NT(food,l)') > 0; 

Oracle Text supports three methods for document 
partition (categorization & clustering). The manual cat- 
egorization allows the user to enter keyword-category 
pairs. The automatic categorization works if a training 
set of document-category pairs is given. The cluster- 
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ing method automatically determines the clusters in a 
set of documents based on their similarity. To provide 
semantic-based matching for any arbitrary domain, the 
users can create their own thesaurus. 



operators. 



CONCLUSION 



FUTURE TRENDS 

In our view, there are three main areas where the role 
of FTS engine should be improved in the future: web 
search engines, ontology-based information retrieval, 
and management of XML documents. The main stan- 
dard for the query of XML documents is nowadays the 
XQuery language. This standard is very flexible for 
selecting structured data elements, but it has no special 
features for the unstructured part. In (Botev, 2004; 
Curtmola, 2005), an extension of XQuery with full-text 
functionality is proposed. The extended query language 
is called TeXQuery and GalaTex. The language con- 
tains a rich set of composite full-text primitives such 
as phrase matching, proximity distance, stemming and 
thesauri. The combination of structure- and content- 
based queries is investigated deeply from a theoretical 
viewpoint in (Amer, 2004). 

The efficiency of information retrieval can be 
improved with the extension of additional semantic 
information. The ALVIS project (Luu, 2006) aims at 
building a distributed, peer-to-peer semantic search 
engine. The peer-to-peer network is a self-organizing 
system for decentralized data management in distrib- 
uted environments. During a query operation, a peer 
broadcasts search requests in the network. A peer may 
be assigned to a subset of data items. The key element 
in the cost reduction is the application of a special index 
type at the nodes. The index contains in addition to the 
single keyword entries also entities for compound keys 
with high discriminative values. 

A very important application area of full-text 
search is the Web. A special feature of Web search is 
that the users apply mostly simple queries. Only 10% 
of queries use some complex full-text primitives like 
Boolean operators, stemming or fuzzy matching. East- 
man (2003) investigated the reasons of omitting the 
complex operators and concluded that the application 
of complex full-text operators does not significantly 
improve the search results. Efficiency is a key factor 
in web search engines (Silvestri, 2004). The goal of 
the research is to upgrade the indexing mechanism of 
web search engines to provide efficient full-text search 



The information is stored on the web and in computers 
mostly in free-text format. The current databases are 
able to store and manage huge document collection. 
Free-text data sources require specific search opera- 
tions. Database management systems usually contain 
a separate full-text search engine to perform full-text 
search primitives. In general, the current FTS engines 
support the following functionalities: exact matching, 
position-based matching, similarity-based matching 
(fuzzy matching), grammar-based matching (stemming) 
and semantic-based matching (synonym- and thesaurus- 
based matching). It has been shown that the average 
user requires additional help to exploit the benefits of 
these extra operators. Current research focuses on solv- 
ing the problem of covering new document formats, 
adapting the query to the user 's behavior, and providing 
an efficient FTS engine implementation. 
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KEY TERMS 

Full-Text Search (FTS) Engine: A module within 
a database management system that supports efficient 
search in free texts. The main operations supported by 
the FTS engine are the exact matching, position-based 
matching, similarity-based matching, grammar-based 
matching and semantic-based matching. 

Fuzzy Matching: A special type of matching where 
the similarity of two terms are calculated as the cost 
of the transformation from one into the other. The 
most widely used cost calculation method is the edit 
distance method. 

Indexer : It builds one or more indices for the speed 
up information retrieval from free text. These indices 
usually contain the following information: terms 
(words), occurrence of the terms, format attributes. 

Inverted Index: An index structure where every 
key value (term) is associated with a list of objects 
identifiers (representing documents). The list contains 
objects that include the given key value. 



Query Refinement Engine: A component of the 
FTS engine that generates new refined queries to the 
initial query in order to improve the efficiency of the 
retrieval. The refined queries can be generated using 
the users' response or some typical patterns in the 
query history. 

Ranking Engine: Amodule within the FTS engine 
that ranks the documents of the result set based on their 
relevance to the query. 

Sectioner: A component of the FTS engine, which 
breaks the text into larger units called sections. The 
types of extracted sections are usually determined by 
the document type. 

Stemmer: It is a language-dependent module that 
determines the stem form of a given word. The stem 
form is usually identical to the morphological root. It 
requires a language dictionary. 

Thesaurus: A special repository of terms, which 
contains not only the words themselves but the similar- 
ity, the generalization and specialization relationships. 
It describes the context of a word but it does not give 
an explicit definition for the word. 

Word-Braker : A component of the full-text engine 
whose function is to break the text into words and 
phrases. 
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INTRODUCTION 

High dimensional data are becoming more and more 
common in data analysis. This is especially true in 
fields that are related to spectrometric data, such as 
chemometrics. Due to development of more accurate 
spectrometers one can obtain spectra of thousands of 
data points. Such a high dimensional data are problem- 
atic in machine learning due to increased computational 
time and the curse of dimensionality (Haykin, 1999; 
Verleysen & Francois, 2005; Bengio, Delalleau, & Le 
Roux, 2006). 

It is therefore advisable to reduce the dimensionality 
of the data. In the case of chemometrics, the spectra are 
usually rather smooth and low on noise, so function fit- 
ting is a convenient tool for dimensionality reduction. 
The fitting is obtained by fixing a set of basis functions 
and computing the fitting weights according to the least 
squares error criterion. 

This article describes a unsupervised method for 
finding a good function basis that is specifically built 
to suit the data set at hand. The basis consists of a set 
of Gaussian functions that are optimized for an accurate 
fitting. The obtained weights are further scaled using a 
Delta Test (DT) to improve the prediction performance. 
Least Squares Support Vector Machine (LS-SVM) 
model is used for estimation. 



BACKGROUND 

The approach where multivariate data are treated as 
functions instead of traditional discrete vectors is 
called Functional Data Analysis (FDA) (Ramsay & 
Silverman, 1997). A crucial part of FDA is the choice 
of basis functions which allows the functional repre- 
sentation. Commonly used bases are B-splines (Alsberg 
& Kvalheim, 1993), Fourier series or wavelets (Shao, 



Leung, & Chau, 2003). However, it is appealing to build 
a problem-specific basis that employs the statistical 
properties of the data at hand. 

In literature, there are examples of finding the op- 
timal set of basis functions that minimize the fitting 
error, such as Functional Principal Component Analysis 
(Ramsay et al., 1 997). The basis functions obtained by 
Functional PCA usually have global support (i.e. they 
are non-zero throughout the data interval). Thus these 
functions are not good for encoding spatial information 
of the data. The spatial information, however, may play 
a major role in many fields, such as spectroscopy. For 
example, often the measured spectra contain spikes 
at certain wavelengths that correspond to certain 
substances in the sample. Therefore these areas are 
bound to be relevant for estimating the quantity of 
these substances. 

We propose that locally supported functions, such 
as Gaussian functions, can be used to encode this sort 
of spatial information. In addition, variable selection 
can be used to select the relevant functions from the 
irrelevant ones. Selecting important variables directly 
on the raw data is often difficult due to high dimension- 
ality of data; computational cost of variable selection 
methods, such as Forward-Backward Selection (Be- 
noudjit, Cools, Meurens, & Verleysen, 2004; Rossi, 
Lendasse, Frangois, Wertz, & Verleysen, 2006), grows 
exponentially with the number of variables. Therefore, 
wisely placed Gaussian functions are proposed as a tool 
for encoding spatial information while reducing data 
dimensionality so that other more powerful informa- 
tion processing tools become feasible. Delta Test (DT) 
(Jones, 2004) based scaling of variables is suggested 
for improving the prediction performance. 

Atypical problem in chemometrics deals with pre- 
dicting some chemical quantity directly from measured 
spectrum. Due to additivity of absorption spectra, the 
problem is assumed to be linear and therefore linear 
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models, such as Partial Least Squares (Hardle, Liang, 
& Gao, 2000) have been widely used for the prediction 
task. However, it has been shown that the additivity 
assumption is not always true and environmental condi- 
tions may further introduce more non-linearity to the 
data (Wiilfert, Kok, & Smilde, 1998). We therefore 
propose that in order to address a general prediction 
problem, a non-linear method should be used. LS-SVM 
is a relatively fast and reliable non-linear model which 
has been applied to chemometrics as well (Chauchard, 
Cogdill, Roussel, Roger, & Bellon-Maurel, 2004). 



require that q is smaller than the number of points in 
the spectra. 

Figure 1 presents a graph of the overall prediction 
method. Gaussian fitting is used for the approximation 
of X. The obtained vectors co are further scaled by a 
diagonal matrix A before the final LS-SVM modeling. 
The following sections explain these steps in greater 
detail. 

Gaussian Fitting: Approximating 
Spectral Function X 



USING GAUSSIAN BASIS WITH 
SPECTOMETRIC DATA 

Consider a problem where the goal is to estimate a 
certain quantity p e 9? from a measured absorption 
spectrum X based on the set of N training examples 
(X jf pj)^ =1 . In practice, the spectrometric data X is 
a set of discretized measurements (x/,y/)^ where 
x/ e [a,b]c: 9? stand for the observation wavelength 
and yj e 9Hs the response. 

Adopting the FDA framework (Ramsay et al., 
1997), our goal is to build a prediction model F so 
that p = F(X) . Here, the argument X is a real-world 
spectrum, i.e. a continuous function that maps wave- 
lengths to responses. Without much loss of generality 
it can be assumed that X belongs to L 2 ([a, £>]), the space 
of square integrable functions on the interval [a,b]. 
However, since the spectrum X is unknown and infinite 
dimensional it is impossible to build the model F(X) 
in practice. Therefore X must be approximated with a 
q dimensional representation co = P(X), P : L 2 -^ $i q 
, and our prediction model becomes p = F(co) . Natu- 
rally, in order to obtain dimensionality reduction, we 



Because the space L 2 ([a, £>]) is infinite dimensional 
function space, it is necessary to consider some finite 
dimensional subspace V a L 2 ([a, b]) in order to obtain 
a feasible function approximation. We define V by a 
set of Gaussian functions 



<P*(*) : 



,/c=l,...,q ? 



(1) 



where t k is the center and a k is the width parameter. The 
set cp k (x) spans a q dimensional normed vector space 
and we can write V = span {(p k (x)} . A natural choice for 



the norm is the L 2 norm: 



f\ = (f f(x) 2 dx) 1/2 

\V J a 



Now X can be approximated using the basis repre- 
sentation j£(x) = co r ())(x), where 

(^(x) = [(p 1 (x),(p 2 (x),...,(p q (x)f. 

The weights co are chosen to minimize the square 
error: 



Figure 1. Outline of the prediction method 
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min£k-co r (|)(x,.)|' 



(2) 



In other words, we simply fit a function to the points 
(Xj,^)™! using the basis functions (p k (x). Now, any 
function x e V is uniquely determined by the weight 
vector co. This suggests that it is equivalent to analyze 
the discrete weight vectors co instead of the continuous 
functions X . 

Orthonormalization 

Radial symmetric models (such as the LS-SVM) depend 
only on the distance metric d(y) in the input space. 
Thus, we require that the mapping from V to 9? q is 
isometric, i.e. d v (f,g) = d q (a,p) for any functions 
f (x) = oc r (|)(x) and g(x) = p r (|)(x). The first distance 
is calculated in the function space and the latter one in 
$i q . In the space V, distances are defined by the norm 



d(f,g)= f-g 



, . Now a simple calculation gives 




cfr = (a-p) r <D(a-p) 



obtained easily by solving the problem (2). The solu- 
tion is the pseudoinverse co = (G r G) _1 G r y (Haykin, 
1999), where y = [y 1? y 2 ,..., y m ] T are the values to be 
fitted and [G]. . = cp.(x.). 

Since the Gaussian functions are diff erentiable, the 
locations and widths can be optimized for a better fit. 
The average fitting error of all functions is obtained 
by averaging Eq. (2) over all of the sample inputs j = 
1, . . .,N. Using the matrix notation given above, it can 
be formulated as 



-I N 

<£iv j=l 



which can be differentiated with respect to t k and a k 
(Kama & Lendasse, 2007). 

Knowing the partial derivates, the locations and 
the widths can be optimized using unconstrained non- 
linear optimization. In this article, Broyden-Fletcher- 
Goldfarb-Shanno (BFGS) Quasi-Newton method 
with line search is suggested. The formulation of the 
BFGS algorithm can be found in Bazaraa, Sherali and 
Shetty(1993). 

An example of spectral data and an optimized basis 
functions in presented in Figure 2. This application is 




where 



Figure 2. Above: NIR absorption spectra. Below: 13 
optimized basis functions 



This implies that if the basis is orthonormal, the ma- 
trix O becomes an identity matrix and the distances 
become equal, i.e. 

||^-g|L = |a-p| q =((a-P) T (a-P)) 1/2 

Unfortunately this is not the case with Gaussian basis 
and a linear transformation oo = Uco need to be applied. 
Here the matrix U is the Cholesky decomposition of O 
= UTJ. In fact, the transformed weights » are related 
to a set of new basis functions (j) = U _1 (|) that are both 
optimized to fit the data and orthonormal. 

Finding an Optimal Gaussian Basis 

When the basis functions are fixed, the weights co are 
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related to prediction of fat content in meat samples using 
NIR absorption spectra (Kama et al., 2007; Rossi et al., 
2006; Thodberg, 1 996). It can be seen that the basis has 
adapted to the data: there are narrow functions in the 
center where there is more variance in the data. 

Variable Scaling 

Variable scaling can be seen as a generalization of 
variable selection; in variable selection variables are 
either included in the training set (corresponding to 
multiplication by 1 ) or excluded from it (corresponding 
to multiplication by 0), while in variable scaling the 
entire range [0,1] of scalars is allowed. In this article, 
we present a method for choosing the scaling using 
Delta Test (DT) (Lendasse, Corona, Hao, Reyhani, & 
Verleysen, 2006). 

The scalars are generated by iterative Forward- 
Backward Selection (FBS) (Benoudjit et al., 2004; 
Rossi et al., 2006). FBS is usually used for variable 
selection, but it can be extended to scaling as well; 
Instead of turning scalars from to 1 or vice versa, 
increases by 1/h (in the case of forward selection) or 
decreases by 1/h (in the case of backward selection) are 
allowed. Integer h is a constant grid parameter. Start- 
ing from an initial scaling, the FBS algorithm changes 
the each of the scalars by ±l/h and accepts the change 
that resulted in the best improvement. The process in 
repeated until no improvement is found. The process 
is initialized with several sets of random scalars. 

DT is a method for estimating the variance of the 
noise within a data set. Having a set of general input- 
outputpairs (x^y,)^ e 5R m x5R and denoting the nearest 
neighbor of x. by x , the DT variance estimate is 



1 N 



■yi\ 



where y is the output of x . Thus, 5 is equivalent 
to the residual (i.e. prediction error) of a first-near- 
est-neighbor model. DT is useful in evaluation of 
dependence of random variables and therefore it can 
be used for scaling: The set of scalars that give the 
smallest d is selected. 

LS-SVM 

LS-SVM is a least square modification of the Support 
Vector Machine (SVM) (Suykens, Van Gestel, De 



Brabanter, De Moor, & Vandewalle, 2002). The qua- 
dratic optimization problem of SVM is simplified so 
that it reduces into a linear set of equations. Moreover, 
regression SVM usually involves three unknown param- 
eters while LS-SVM has only two; the regularization 
parameter y and the width parameter 9. 

Given a set of iVtraining examples (x^ja)^ e 9t m x9? 
the LS-SVM model is y = w T \|/(x) + b, where 
v|/ : y{ m -> 9t n is a mapping from the input space onto 
a higher dimensional hidden space, w e 9T is a weight 
vector and b is a bias term. The optimization problem 
is formulated as 

1 1 N 

Min J(w,b) = -|w| +-yX e i 2 

so that y. = w r \|/(x z . ) + b + e. , 

where e. is the prediction error and y > is a regular- 
ization parameter. The dual problem is derived using 
Lagrangian multipliers which lead into a linear KKT 
system that is easy to solve (Suykens et al., 2002). 
Using the dual solution, the original model can be 
reformatted as 

N 

y(x) = ^a z K(x,x / ) + b 

i=l > 

where the kernel K(x, x. ) = \|/(x) T \|/(x z . ) is a continuous 
and symmetric mapping from SR m x 9? m to 9? and a. are 
the Lagrange multipliers. A widely-used choice for the K 
is the standard Gaussian kernel K(x v x 2 ) = e" Xl-X2 " 2//e . 
The LS-SVM prediction is the final step in the pro- 
posed method where spectral data is compressed by the 
Gaussian fitting and the fitting weights are normalized 
and scaled before the prediction. More elaborate discus- 
sion and applications to real-world data are presented 
in Kama et al. (2007). 



FUTURE TRENDS 

The only unknown parameter in the proposed method 
is the number of basis functions which is selected by 
validation. In future other methods for determining 
good basis size should be developed in order to speed 
up the process. Moreover, the methodology should 
be tested with various data sets, including other than 



664 



Functional Dimension Reduction for Chemometrics 



spectral data. The LS-SVM predictor could be also 
replaced with another model. 

Although the proposed Gaussian fitting combined 
with LS-SVM model seems to be fairly robust, the 
relation between the basis functions and the prediction 
performance should be studied in detail. It would be 
desirable to optimize the basis directly for best possible 
prediction performance (instead of good data fitting), 
although it seems difficult due to over-fitting and high 
computational costs. 



CONCLUSION 

This article deals with the problem of finding a good set 
of basis functions for dimension reduction of spectral 
data. We have proposed a method based on Gaussian 
basis functions where the locations and the widths of 
the functions are optimized to fit the data as accurately 
as possible. The basis indeed tends to follow the nature 
of the data and provides a good tool for dimension 
reduction. Other methods, such as the proposed DT 
scaling, will benefit from the smaller data dimension 
and help to achieve even better data compression. The 
LS-SVM model is a robust and fast method to be used 
in the final prediction. 
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KEY TERMS 

Chemometrics: Application of mathematical or 
statistical methods to chemical data. Closely related 
to monitoring of chemical processes and instrument 
design. 

Curse of Dimensionality: A theoretical result in 
machine learning that states that the lower bound of 
error that an adaptive machine can achieve increases 
with data dimension. Thus performance will degrade 
as data dimension grows. 

Delta Test: A Non-parametric Noise Estimation 
method. Estimates the amount of noise within a data 
set, i.e. the amount of information that cannot be 
explained by any model. Therefore Delta Test can be 
used to obtain a lower bound of learning error which 
can be achieved without risk of over-fitting. 

Functional Data Analysis: A statistical approach 
where multivariate data are treated as functions instead 
of discrete vectors. 



Least Squares Support Vector Machine: A least 
squares modification of the Support Vector Machine 
which leads into solving a linear set of equations. Also 
bears close resemblance to Gaussian Processes. 

Machine Learning: An area of Artificial Intel- 
ligence dealing with adaptive computational methods 
such as Artificial Neural Networks and Genetic Al- 
gorithms. 

Over-Fitting: A common problem in Machine 
Learning where the training data can be explained well 
but the model is unable to generalize to new inputs. 
Over-fitting is related to the complexity of the model: 
any data set can be modelled perfectly with a model 
complex enough, but the risk of learning random fea- 
tures instead of meaningful causal features increases. 

Support Vector Machine: A kernel based su- 
pervised learning method used for classification and 
regression. The data points are projected into a higher 
dimensional space where they are linearly separable. 
The proj ection is determined by the kernel function and 
a set of specifically selected support vectors. Training 
process involves solving a Quadratic Programming 
problem. 

Variable Selection: Process where unrelated input 
variables are discarded from the data set. Variable selec- 
tion is usually based on correlation or noise estimators 
of the input-output pairs and can lead into significant 
improvement in performance. 
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INTRODUCTION 

Functional networks are a generalization of neural 
networks, which is achieved by using multiargument 
and learnable functions, i.e., in these networks the 
transfer functions associated with neurons are not fixed 
but learned from data. In addition, there is no need 
to include parameters to weigh links among neurons 
since their effect is subsumed by the neural functions. 
Another distinctive characteristic of these models 
is that the specification of the initial topology for a 
functional network could be based on the features of 
the problem we are facing. Therefore knowledge about 
the problem can guide the development of a network 
structure, although on the absence of this knowledge 
always a general model can be used. 

In this article we present a review of the field of 
functional networks, which will be illustrated with 
practical examples. 



network models are Multilayer Perceptrons (MLP) for 
which many learning algorithms can be used: from the 
brilliant backpropagation (Rumelhart, Hinton & Wil- 
lian, 1986) to the more complex and efficient Scale 
Conjugate Gradient (Moller, 1 993) or Levenberg-Mar- 
quardt algorithms (Hagan & Menhaj, 1994). 

In addition, also the topology of the network (number 
of layers, neurons, connections, activation functions, 
etc.) has to be determined. This is calledstri/cti/ra/ learn- 
ing and it is carried out mostly by trial and error. 

As a result, there are two main drawbacks in dealing 
with neural networks: 

1. The resulting function lacks of the possibility of 
a physical or engineering interpretation. In this 
sense, Neural Networks act as black boxes. 

2. There is no guarantee that the weights provided 
by the learning algorithm correspond to a global 
optimum of the error function, it can be a local 
one. 



BACKGROUND 

Artificial Neural Networks (ANN) are a powerful tool 
to build systems able to learn and adapt to their environ- 
ment, and they have been successfully applied in many 
fields. Their learning process consists of adjusting the 
values of their parameters, i.e., the weights connecting 
the network's neurons. This adaptation is carried out 
through a learning algorithm that tries to adjust some 
training data representing the problem to be learnt. This 
algorithm is guided by the minimization of some error 
function that measures how well the ANN is adjusting 
the training data (Bishop, 1995). This process is called 
parametric learning. One of the most popular neural 



Models like Generalized Linear Networks (GLN) 
present an unique global optimum that can be obtained 
by solving a set of linear equations. However, its map- 
ping function is limited as this model consists of a 
single layer of adaptive weights (w.) to produce a linear 
combination of non linear functions (()).): 



Some other popular models are Radial Basis Function 
Networks (RBF) whose hidden units use distances to 
a prototype vector (|u.) followed by a transformation 
with a localized function like the Gaussian: 



Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. 



Functional Networks 



y& = Zl w j m j < x > = XL % ex p 



|2\ 



2o 



The resulting architecture is more simple than the 
one of the MLP, therefore reducing the complexity of 
structural learning and propitiating the possibility of 
physical interpretation. However, they present some 
other limitations like their inability to distinguish non 
significant input variables (Bishop, 1 995), to learn some 
logic transformations (Moody & Darken, 1989) or the 
need of a large number of nodes even for a linear map 
if precision requirement is high (Youssef, 1993). 

Due to these limitations, there have been appearing 
some models that extend the original ANN, such as, 
fuzzy neural networks (Gupta & Rao, 1994), grow- 
ing neural networks, or probabilistic neural networks 
(Specht, 1 990). Nowadays, the majority of these models 
still act as black boxes. Functional networks (Castillo, 
1998, Castillo, Cobo, Gutierrez, & Pruneda, 1998), a 
relatively new extension of neural networks, take into 
account the functional structure and properties of the 
process being modeled, that naturally determine the 
initial network's structure. Moreover, the estimation 
of the network's weights it is often based on an error 
function that can be minimized by solving a system 
of linear equations, therefore conducting faster to an 
unique and global solution. 



NETWORKS 

Functional networks (FN) are a generalization of neural 
networks, which is achieved by using multiargument 
and learnable functions (Castillo, 1998, Castillo, 
Cobo, Gutierrez, & Pruneda, 1998), i.e., the shape of 
the functions associated with neurons are not fixed 
but learned from data. In this case, it is not necessary 
to include weights to ponder links among neurons 
since their effect is subsumed by the neural functions. 
Figure 1 shows an example of a general FN for I=N 
explanatory variables. 

Functional networks consist of the following ele- 
ments: 

a. Several layers of storing units (represented in 
Figure 1 by small filled circles). These units are 
used for the storage of both the input and the 
output of the network, or to storage intermediate 
information (see units y. (k) in Figure 1). 

b. One or more layers of functional units or neurons 
(represented by open circles with the name of 
each of the functional units inside) . These neurons 
include a function that can be multivariate and 
that can have as many arguments as inputs. These 
arguments, and therefore the form of the neural 
functions, are learnt during training. By applying 
their functions, neurons evaluate a set of input 



DESCRIPTION OF FUNCTIONAL 



Figure 1. Generalized model for functional networks 
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values in order to return a set of output values 
to the next layer of storing units. In this general 
model each neural function f (m) is defined as the 
following composition: 



f, (m, fc 



m-l) 



v 0"-i)V 



g r (ht\yr } y--M N m L(y ( c)) 



where the superscript (m) is the number of 
layer. The functions g t (m) are known and fixed 
before training, for example to be the sum or 
product. In contrast, functions h^ are lineal 
combinations of other known functions d>. 
(for example, polynomials, cosines, etc.), i.e. 

c frp > El™' a ^r frr) where the c °- 

efficients a^ ] implied in this linear combination 
are the model parameters to be learned. As can be 
observed, MLPs, GLNs and RBFs are particular 
cases of this generalized model, 
c. A set of directed links that connect the functional 
units and the storing units. These connections 
indicate the direction of the flow of information. 
The general FN in Figure 1 does not have arrows 
that converge in the same storing unit, but if it did, 
this would indicate that the neurons from which 
they emanate must produce identical outputs. 
This is an important feature of FNs that is not 
available for neural networks. These converging 
arrows represent constraints which can arise from 
physical and/or theoretical characteristics of the 
problem under consideration. 

Learning in Functional Networks 

Functional networks combine knowledge about the 
problem to determine the network, and training data 
to estimate the unknown neural functions. Therefore, 
in contradistinction to neural networks, FNs include 
two types of learning: 

1. Structural learning. The specification of the initial 
topology for a FN can be based on the features 
of the problem we are facing (Castillo, Cobo, 
Gutierrez, & Pruneda, 1998). Usually knowledge 
about the problem can be used in order to develop 
a network structure. An important feature of FN 
is that they allow managing functional restric- 



tions determined by some known properties of 
the model to be estimated. These restrictions can 
be representing by forcing the outputs of some 
neurons to coincide in a unique storage unit. 
Later on, the network can be translated into a 
system of functional equations that usually can 
be simplified in order to obtain a more simple but 
equivalent architecture. Finally, on the absence of 
knowledge about the problem always the general 
model, shown in Figure 1 , can be used. 
2. Parametric learning. This second stage refers to 
the estimation of the neuron's functions. Often 
these neural functions are considered to be lineal 
combinations of functional families, and therefore 
the parametric learning consists of estimating 
both the arguments of the neural functions and 
the parameters of the lineal combination using the 
available training data. It is important to remark 
that this type of learning generalizes the idea of 
estimating the weights of a neural network. 

An Example Of A Functional Network 

In this section the use of FNs is illustrated by means of 
an artificial simple example. Let's suppose aproblem of 
engine diagnosis for which three continuous variables 
(x=' vibrations', y='oil density', z='temperature') are 
being monitored. The problem is to estimate the prob- 
ability P of a given diagnosis based on these variables, 
i.e., P{x,y, z). Moreover, we know that the information 
provided by the monitored variables is accumulative. 
Therefore, it is possible, for example, to calculate first 
the probability P x {x, y) of a diagnosis based on only 
variables x and y, and later on when variable z is avail- 
able combine the value provided by P 1 with the new 
information z to obtain P(x, y, z). That is, there exist 
some functions such as: 

P(x, y, z) = FlP^x, y\ z] = K[P 2 (y, z), x] = L[P 3 (x, z), 
y] (1) 

This situation suggests the structure of the FN 
shown in Figure 2a, where I is the identity function. 
The coincident connections in the store output unit, or 
equivalently eq. 1 , establish strong restrictions about 
the functions P v P 2 , P , F, K, L. The use of methods 
for functional equations allows to deal with eq. 1 in 
order to obtain the corresponding functional conditions 
from which it is possible to derive a new equation for 
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function P: 

P(x,y,z) = /c[p(x) + q(y) + g(z)]. 

This leads to the new more simple FN represented 
in Figure 2b which is equivalent to that of Figure 2a. 

A Comparison Between Functional and 
Neural Networks 

Although FNs are extensions of neural networks, there 
are some main features that distinguish both models: 



1. 



2. 



3. 
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Neural networks are derived only from data about 
the problem. However, FNs can also use knowl- 
edge about the problem to derive its topology, 
incorporating properties about the function to be 
modeled. 

During learning in neural networks the shape of 
neural functions is fixed usually to be a sigmoid 
type function, and only the weights can be adapted. 
In FNs, neural functions are also learnt. 
Neural functions that can be employed in neural 
networks are limited and belong to some known 



family. Also, for each layer the same function 
is used for every neuron. In FNs any arbitrary 
function can be used for each neuron. 

4. These functions can be multiargument and mul- 
tivariate. In neural networks activation functions 
have only one argument (combination of several 
input data). 

5. In FNs it is possible to force the output of some 
neurons to coincide by connecting them to the 
same storing unit. These connections are restric- 
tions to the model that sometimes can be used to 
derive a more simple model. 

Some Functional Network Models 

In this section some typical FN models are presented, 
that let solving several real problems. 

The Uniqueness Model 

This is a simple but very powerful model for which 
the output z of the corresponding FN architecture can 
be written as a function of the inputs x and y, 



z = F(x,y) = f 3 -%(x) + f 2 (y)) 



(2) 
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Uniqueness of Representation. For this model to 
have uniqueness of solution it is only required to fix the 
functions f v f 2 , f 3 at a point (see explanation in Castillo, 
Cobo, Gutierrez, & Pruneda, 1998). 

Learning the model. Learning the function F(x, y) 
in eq.2 is equivalent to learning the functions from a 
data set, {(x., y., z.): j = 1,..., n} where z is the desired 
output for the given inputs. To estimate f v f 2 , f 3 we can 
employ the non-linear and linear methods: 

1 . The Non-Linear Method. We approximate each 
of the functions f ± , f 2 , ff'z = F(x, y) = ff%(x) + 
f 2 (y)) by considering them to be a linear combi- 
nation of known functions from a given family 
(e. g., polynomial). Finally, the following sum of 
squared errors is minimized, 

n f m 3 ( m \ m 2 \\ 

7=1 V k =l \ i=1 i=1 J J 

2 . Linear Method. A simplification of the non-linear 
method can be done by considering the following 
equivalence: 



UZj)^ f^x^+f^y^ij^l,...^ 

Again the functions f s can be approximated as a 
linear combination of known functions from a 
given family. Finally, the following sum of square 
errors, 

7=1 

n f m 1 m 2 m 3 \ 

Z Z Mli ( X J ) + Z a 2$2i (y j ) - Z a 3^3/ ( Z j) 

j=l V i=l i=l i=l 

can be minimized by solving a system of linear 
equations, where the unknowns are the coeffi- 
cients a si as it is demonstrated in (Castillo, Cobo, 
Gutierrez, & Pruneda, 1998). 

The Generalized Associativity Model 

Figure 3a shows a generalized associativity FN of 
three inputs, where the nodes I represent the identity 
function. This model is based on the generalized asso- 
ciative property, that is, the output of this network can 
be obtained as a function of G(x, y) and the input z, or 




x 



Figure 3. The generalized associativity functional network 
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as a function of the input x and N(y, z). This property 
is represented with the links convergent to the output 
node i/, which leads to the functional equation 



F[G(x,y),z]=K[x,N(y,z)] 



(3) 



Simplification of the model. It can be shown that 
the general solution of eq. 3 is: 

F(x,y) = k[f(x) + r(y)]G(x,y) = f^x) + q(yj\ 
K(x,y) = k{p(x) + n(y)] N(x,y) = n" 1 ^*) + r(y)] 

(4) 

where f r, k, n, p, q are arbitrary continuous and strictly 
monotonic functions. Substituting eq. 4 in eq. 3, the 
following result is obtained 

F[G(x, y), z] = K[x, N(y, z)] = u = k[p(x) + q(y) + 
r(z)] (5) 



Thus, the FN in Figure 3b is equivalent to the FN 
in Figure 3 a. 

Uniqueness of Representation. By employing func- 
tional equations for the generalized associativity model 
it can be demonstrated that uniqueness of solution 
requires fixing the functions k, p, q, r at a point (see 
Castillo, Cobo, Gutierrez, & Pruneda, 1998). 

Learning the model. The problem of learning the 
FN in Figure 3b involves estimating the functions k, 
p, q, r in eq. 5, that can be rewritten as: 

/r 1 (i/)=p(x) + q(y) + r(z) 

Being {(x 1/? x 2i , x 3i , x 4 )\i = l,..,n} with (x ± , x 2 , x y x 4 
= x, y, z, u) the observed training sample of size n we 
can define the error 

e t = P<X) + <?0 2 ;) + f(x 3i ) - k~\x 4i );i = l,...,n 



Figure 4. Separable functional network architecture 
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Suppose that each of the functions is a linear 
combination of known functions from given families 
(e.g. polynomial). Then, the sum of squared errors is 
defined as 



Q=Z e , 2 =ZZZW*«) 

i=l i=l \k=l j=l 



Employing the Lagrange multipliers technique, the 
minimum is obtained by solving the following system 
of linear equations: 



is 



da kr i=i 



2£e,M* w ) + VMaJ = 0;Vk,r 



£aA(a k )-P k =0;Vk, 



•k 7=1 



where the unknowns are the multipliers X v ...,X 4 the co- 
efficients in the set {a kj \ j = l,...,m k ;k =1,2,3,4} which 
are the parameters of the FN. 

The Separable Model 

Consider the equation 

n m 

z = F(x,y) = X m^y) = X>,(*)fc,Cy) 

which can be written as 

Zfi(*)g,(y) = o 

i=l 

where 

9i(y) = -fci_„(y);i = " +X-,n + m 



(6) 



This suggests the FN in Figure 4a. 

Simplification of the model. Assuming that 
{fi(x)v. .,£(*)}, (g r+1 (x),...,g k (x)} are two sets of linearly 
independent functions, the general solution of eq. 6 

Z/i(*)&00 = o 



fjW = I fl ^kW;i = r+ Uk, 




k-r 

g s (y) = -Z a js9 r+J (y);s = i,...,r 

;'=i 

By replacing these terms in equation eq. 6 we 
obtain 



z = F(x,y) = XZ c u fi( x )9j(y) 



i=i j=i 



(7) 



where c.. are the parameters of the model, and which 
leads to the simplified FN in Figure 4b. 

Uniqueness of Representation. In this case the 
uniqueness of representation is given without the need 
of fixing the implied functions at any point. 

Learning the model. In this case a simple least 
squares method allows obtaining the optimal coeffi- 
cients c.. using the available data {(x Q ., x lf , x 2/ )|z = 1,.., 
n} with (x , x v x 2 = z, x, y). In this way, the error can 
be obtained as, 

r s-r 

Thus, to find the optimum coefficients we minimize 
the sum of squared errors 

In this case, the parameters are not constrained by extra 
conditions, so the minimum can be obtained by solving 
the following system of linear equations, where the 
unknowns are the coefficients c: 



dQ 



dc 



^e k f p (x lk )g q (x 2k ) = 0; 



pq k=l 

p=l,...,r;q = l,...,r-s 



Examples of Applications 

In this section, illustrative examples for two different 
models of FN are presented. These models were ap- 
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plied to a regression and a classification problem. In all 
cases, the functions of each layer were approximated by 
considering a linear combination of known functions 
from a polynomial family. 

Classification Problem 

The first example shows the performance of a FN solving 
a classification problem: the Wine data set. This data- 
base can be obtained from the UCI Machine Learning 
Repository 1 . The aim of this problem is to determine 
the origin of wines using a chemical analysis of 13 



continuous attributes. The set contains 178 instances 
that must be classified in three different classes. 

For this case, the Separable Model (Figure 4) 
with three output units was employed. Moreover, its 
performance is compared to other standard methods: 
a Multilayer Perceptron (MLP), a Radial Basis Func- 
tion Network (RBF) and Support Vector Machines 
(SVM). 

Figure 5 shows the comparative results. The first 
subfigure contains the mean accuracy obtained using 
a leaving-one-out cross-validation method. As can be 
observed, the FN obtains a very good performance for 



Figure 5. Accuracy and training time obtained by different models for the wine data set 
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the test set. Regarding the time required for the learning 
process, the second subfigure shows that the FNs are 
comparable with the other methods. 

Regression Problem 

In this case the aim of the network is to predict the 
failure shear effort in concrete beams based on several 
geometrical, longitudinal and transversal parameters 
of the beam (Alonso-Betanzos, Castillo, Fontenla- 
Romero, & Sanchez-Marono, 2004). 

A FN, corresponding to the Associative Model 
(Figure 3), and also a MLP were trained employing 
a ten- fold cross-validation, running 30 simulations 
using different initial parameter values. A set with 12 
samples was kept for further validation of the trained 
systems. The mean normalized Mean Squared Errors 
over 30 simulations obtained by the FN was 0.1789 
and 0.8460 for test and validation, respectively, while 
the MLP obtained 0.1361 and 2.9265. 



FUTURE TRENDS 

Functional networks are being successfully employed 
in many different real applications. In engineering prob- 
lems they have been applied, for instance, for surface 
reconstruction (Iglesias, Galvez, & Echevarria, 2006). 
Other works have used these networks for recovering 
missing data (Castillo, Sanchez-Marono, Alonso-Be- 
tanzos, & Castillo, 2003) and for general regression 
and classification problems (Lacruz, Perez-Palomares 
& Pruneda, 2006) . 

Another recent research line is related to the investi- 
gation of measures of fault tolerance (Fontenla-Romero, 
Castillo, Alonso-Betanzos, & Guijarro-Berdinas, 2004), 
in order to develop new learning methods. 



CONCLUSION 

This article presents a review of functional networks. 
Functional networks are inspired by neural networks 
and functional equations. This model offers all the 
advantages of ANNs, such as noise tolerance and 
generalisation capacity, adding new advantages. One 
of them is the possibility to use knowledge about the 
problem to be modeled to derive the initial network 
topology, thus resulting on a model that can be physical 



or engineering interpreted. Another main advantage is 
that the initially proposed model can be simplified, using 
functional equations, and learnt by solving a system of 
linear equations, which speeds the learning process and 
avoid it to be stuck in a local minimum. Finally, the 
shape of neural function does not have to be fixed, but 
they can be fitted from data during training, therefore 
widening the modeling ability of the network. 
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KEY TERMS 



Functional Equation: An equation for which its 
unknowns are expressed in terms of both independent 
variables and functions. 

Functional Network: A structure consisting of 
processing units and storing units. These units are 
organized in layers and linked by connections. Each 
processing unit contains a multivariate and multiargu- 
ment function to be learnt during a training process. 

Lagrange Multiplier: Given the function f(x , 
x 2 ,...,x n ), the Lagrange multiplier X is used to find the 
extremum of f subject to a constraint g{x v x 2 ,...,x n ) by 
solving 



df + ^ = 0,V/c=l,...,n 



8x 



dx h 



Learning Algorithm: A process that, based on 
some training data representing the problem to be 
learnt, adapts the free parameters of a given model, 
such as a neural network, in order to obtain a desired 
functionality. 

Linear Equation: An algebraic equation involving 
only a constant and first-order (linear) terms. 

Uniqueness: Property of being the only possible 
solution. 



Error Function: When talking about learning, 
this is a function that quantifies how much a system 
has learnt. One of the most popular error functions is 
the mean squared error that measures the differences 
between the answers provided by a system and the 
correct answer. 



ENDNOTE 



Web page: www.ics.uci.edu/~mlearn/MLReposi- 
tory.html 
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INTRODUCTION 

State estimation of dynamic systems is a resort often 
used when only a subset of the state variables can be 
directly measured; observers are the entities comput- 
ing the system state from the knowledge of its internal 
structure and its (partially) measured behaviour. The 
problem of discrete event systems (DES) estimation has 
been addressed in (Ramirez, 2003) and (Giua 2003); in 
these works the marking of a Petri net (PN) model of 
a partially observed event driven system is computed 
from the evolution of its inputs and outputs. 

The state of a system can be also inferred using the 
knowledge on the duration of activities. However this 
task becomes complex when, besides the absence of 
sensors, the durations of the operations are uncertain; in 
this situation the observer obtains and revise a belief that 
approximates the current system state. Consequently 
this approach is useful for non critical applications of 
state monitoring and feedback in which an approximate 
computation is allows. 

The uncertainty of activities duration in DES can 
be handled using fuzzy PN (FPN) (Murata, 1996), 
(Cardoso, 1999), (Hennequin, 2001), (Pedrycz, 2003), 
(Ding, 2005); this PN extension has been applied to 
knowledge modelling (Chen, 1990), (Koriem, 2000), 
(Shen, 2003), planning (Cao, 1996), reasoning (Gao, 
2003) and controller design (Andreu, 1997), (Leslaw, 
2004). 

In these works the proposed techniques include the 
computation of imprecise markings; however the class 
of models dealt does not include strongly connected 
PN for the modelling of cyclic behaviour. In this article 
we address the problem of state estimation of DES for 
calculating the fuzzy marking of a Fuzzy Timed Petri 
Net (FTPN); for this purpose a set of matrix expressions 
for the recursive computing the current fuzzy marking 
is developed. The article focuses on FTPN whose struc- 
ture is a Marked Graph (called Fuzzy Timed Marked 



Graph -FTMG) because it allows showing intuitively 
the problems of the marking estimation in exhibiting 
cyclic behaviour. 



BACKGROUND 
Possibility Theory 

In theory of possibility, a fuzzy set a is used for de- 
limiting ill-known values or for representing values 
characterized by symbolic expressions. The set is 
defined as a = (a ly a 2 ,a 3 ,a 4 ) such that a 1 ,a 2y a 3 ,a 4 eR 
, a x <a 2 and cz 3 < cz 4 . The fuzzy set a delimits the run 
time as follows: 

The values x b ,x a in the ranges (a p a 2 ), (a 3 , a 4 ), 
respectively, indicate that the activity is possi- 
bly executed with a (x) e (0,l). When xef b the 
function a (t) grows towards 1, which means 
that the possibility of stopping increases. When 
x g f a , the membership function a (x) decreases 
towards 0, representing that there is a reduction 
of the possibility of stopping. 
The values (0,aj mean that the activity is run- 
ning. 

The values [a 4 ,+oo)mean that the activity is 
stopped. 

The values x~ a e [a 2 ,a 3 ]| a 2 < a 3 represent full 
possibility that is a (x) = 1, this represents that it 
is certain that the activity is stopped. 
The support of a is the range x e [a 1? a 4 ] where 
a 5 (x)>0. 

A fuzzy set a is referred indistinctly by the function 
a(x) or the characterization (a { ,a 2 ,a 3 ,a 4 ). For sim- 
plicity, in this work the fuzzy possibility distribution 
of the time is described with trapezoidal or triangular 
forms. For example, Fig.l shows the fuzzy set that 
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Figure 1. Fuzzy set 



Fuzzy Approximation of DES State 



Membership 

Uncertainty degree 



label 




Characterization 

3 = ( a lf a 2' a 3/ a 4) 



time 



it is represents in natural language: "the activity will 
stop about 2.5". 

Fuzzy extension principle. The fuzzy extension 
principle plays a fundamental role because we can 
extend functions defined on crisp sets to functions on 
fuzzy sets. An important application of this principle 
is a mechanism to operate arithmetically with fuzzy 
numbers. 

Definition. LetX x ,.. . ,X n be crisp sets and let fa func- 
tion such f : X*... x X n ^> Y . Ifa^... ,a n are fuzzy sets 
on X p ... ,X n , respectively, thenf(a v ... ,aj is the fuzzy 
set on Y such that: 

f (a 1 ,...,a n )=u (xi Xn>(Xi ,., Xn) {a di (x ± )a...a a, n (x n )/ f (x 1 ,...,x n )]] 

If ' b= f(a v ... ,aj then b is the fuzzy set on Y such 
that: 

^(y)= V (x 1 ,...,, n )e(X 1 ,.,Z n )f(x 1 ,...,x n > y [ a 5l (X)A...A a^ (X n )] 

The fuzzy set was characterized as: 
a = {a~(x 1 )/x 1 ,..., a a ~(x n )/x n } 

With the extension principle we can define a simplified 
fuzzy sets addition operation. 

Definition. Let a = (a { ,a 2 ,a 3 ,a 4 ) and b = (b { ,b 2 ,b 3 ,b 4 ) 
be two trapezoidal fuzzy sets. The fuzzy sets addition 
operation is: a®b = (a l + b x ,a 2 +b 2 ,a 3 + b 3 ,a 4 +b 4 ) 
(Klir t 1995). 



Definition The intersection and union of fuzzy sets are 
defined in terms ofmin and max operators. 

(anb )=mm(a,b)=mm(a a (i;), o^ ( xY)| x e support of d aS 



and 



(duM= max (a, b j= max (a ~ (t), o^ ( t))| te support of _avb 

We used these operators, intersection and union, as a 
t-norm and a s-norm, respectively. 

Definition The distribution of possibility before 
and after a are the fuzzy sets a b =(-oo,a 2 ,a 3 ,a 4 ) and 
Q a =( a \> a 2> a 3>+ co ) respectively; they are defined in 
(Andreu, 1997) as a function a ( oo - ] (x) = sup a(V) and 
a ( a ;+ oo]( T ) =su P a ( T ')> respectively. 



Petri Nets Theory 

Definition. An ordinary PN structure G is a bipartite 
digraph represented by the 4-tuple G = (P,T,I,0) 
where P = {p v p 2 ,..., p n }and T = {t l ,t 2 ,...,t m ) are 
finite sets of vertices called respectively places and 
transitions, 1(0): PxT — > {0,l}zs a function rep- 
resenting the arcs going from places to transitions 
(transitions to places). 

Pictorially, places are represented by circles, transi- 
tions are represented by rectangles, and arcs are de- 
picted as arrows. The symbol •£, (tj •^denotes the set 
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of all places p. such that / (p z , t } . )* fo (p. , t . )* o). 
Analogously, »p z (p z •) denotes the set oFall transitions 
t. such that O (p. , t . )* (/ (p. , t . )* o). 

The pre-incidence matrix of G is C~ = I qT J where 
c~ = iYp.,£.Y the post-incidence matrix of G is 
C + = [qj" J where c» = O (p z , t ; . ); the incidence matrix 
of Gis C = C + -C". 

A marking function M :P—>Z + represents the 
number of tokens or marks (depicted as dots) resid- 
ing inside each place. The marking of a PN is usually 
expressed as an n-entry vector. 

Definition. A Petri Net system or Petri Net (PN) is the 
pairN = (G, MJ, where G is a PN structure andM Q is 
an initial token distribution. 

In a PN system, a transition t. is enabled at the 
marking M k if Vp f e P, M k (p t ) > / (p. , t. ); an enabled 
transition t. can be fired reaching a new marking M k+l 
which can be computed using the PN state equation: 



M k+1 =M k +C\-C-v k 



(1) 



where v k (i) = 0,i * j,v k (j) = l. 

The reachability set of a PN is the set of all pos- 
sible reachable marking from M Q firing only enabled 
transitions; this set is denoted by R(G, M Q ). 

A structural conflict is a PN sub-structure in which 
two or more transitions share one or more input places; 
such transitions are simultaneously enabled and the fir- 
ing of one of them may disable the others, Fig. 3(b). 

Definition. A transition t k ^T is live, for a marking 
M Q , if \/M k eR(G,M )> 3M n eR(G,M )such that 
t is enabled 



M n 



A PN is live if all its transitions are live. 

Definition. A PN is said 1-bounded, or safe, for a 
marking M Q , if \/p. e P and VM . e R (G,M ), it holds 
that M j (p i )<l. l 

In this work we deal with live and safe PN. 



Definition. Ap-invariant Y. (t-invariantX^ of a PNis a 
positive integer solution of the equation Y t T C = (CX= 
0). The support of thep-invariantY.(t-invariantX^ is the 

«tM=^|i;(p j >o}(W=ii^fe> }) 

Definition. Let Y a p-invariant of a Petri net (G, M ), 
llYJI the support ofY., then the induced subnet by Y is 



\Pje\n,I„0) 

named p-component, where 



I. =p.xT i nI O z =P i xT i nO. 



Definition. LetX. be a t-invariant of a PN, and ||X [I be 

i ' II ' H 

the support ofX., then the induced subnet by X. is 




TC t =(P t = {jp k e.t J9 p l et j . 

|t J e||X J |},i;=||X f ||,I l ,O l ) 

named t-component. 



I^P^nl Qnd O i =P i xT i nO. 



Definition. A invariant Z is minimal if no invariant 

i I 

Z satisfies \\Z , c Z, , where Z,Z are p-invariants or 

j J II j II ii ' ii i j r 

t-invariants and \/zGZ t :z> 0. 

Definition. Let Z = {z i ,...,Z q jbe the set of minimal 
invariants (Silva, 1982) of a PN, then Z is called the 
invariants base. The cardinality of Z is represented 

as \z\. 



FUZZY TIMED PETRI NETS 
Basic Operators 

We introduce first some useful operators. 

Definition. In order to get the fuzzy set between f and 
g , the Imax function is defined as: 
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lmax(f,g)=min(f°,g b ) 



(2) 



Definition. The latest (earliest) operation selects the 
latest (earliest) fuzzy set among n fuzzy sets; they are 
calculated as follows: 



latest (f {,..., f„)= 
mm(max(f 1 b ,...,~f n b ),min(f 1 ",...,~f n ")) 



(3) 



earliest (j [,..., f n )= 

min(min(f b ,..., f b ),max(f°,..., f n °)) 



(4) 



Definition. The fuzzy _conjugation-operator is defined 

oper 

as argl • arg 2 , where argl, arg2 are arguments that 
can be matrices of fuzzy sets; • is the fuzzy and operation 
and oper is any operation referred as, +, -, latest, min, 
etc. For some row z = l,...m and some column] = 1,... 
n) the products and (h ,g kj \k = l,...,r are computed 

as °P er d™ d (L fi% )). For example: 



flu ••' Sin 



L ■ 


• /L" 


+ 
• 


/ml 


f 

1 mr _ 





Z^ /lkfljcl .Zj 'lfc9k 



7 . tmkdkl 2^1 fmkdk 



Formalism Description of the FTPN 

Definition. A fuzzy timed Petri net structure is a 3- 
tuple FTPN = (N,T£); where N = (G, MJ is a PN, 
r = {d 1? d 2 , . . . , a n } is a collection of fuzzy sets, £ : P -» T 
is a function that associates a fuzzy set d i e T to each 
place Pi e P. 

Fuzzy timing of places 

The fuzzy set a = (a { ,a 2 ,a 3 ,a 4 ^ Fig.2(b) represents 
the static possibility distribution a (x a )e [0,l] of the 
instant at which a token leaves a place peP , starting 
from the instant when p is marked. This set does not 
change during the FTPN execution. 

• Fuzzy timing of tokens 

The fuzzy set b = (b 1 ,b 2 ,b 3 ,b 4 ) Fig.2(c) represents 
the dynamic possibility distribution p (x b )e [0,l] as- 
sociated to a token residing within a p e P ; it also 
represents the instant T b at which such a token leaves 
the place, starting from the instant when p is marked, 
b is computed from a every time the place is marked 



Figure 2. (a) Fuzzy timed Petri net. (b) The fuzzy set associated to places, (c) Fuzzy set to place or mark associ- 
ated, (d) Fuzzy timestamp 



*) 



|a(%Aj4}*ftt] ,, P 
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(b) 
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during the marking evolution of the FTPN. A token 
begins to be available for enabling output transitions 
at p (b^. Thus b° =(b 1 ,b 2 ,b 3 ,+oo) represents the pos- 
sibility distribution of available tokens. The fuzzy set 
c = (c v c 2 ,c 3 ,c 4 ), known as fuzzy timestamp, Fig.2(d) 
is a dynamic possibility distribution q (u c )e [0,l] 
that represents the duration of a token within a place 
peP. 

Enabling and Firing of Transitions 

Fuzzy enabling date 

The fuzzy enabling date \ (x ) of the transition t k 
at the instant x is a possibility distribution of the latest 
leaving instant among the leaving instants b p of all 
tokens within the p t e 9 t k , Fig.3(a). 



e tk (x) = latest (b Pi y/ Pi 



e*t fr 



(5) 



The latest operation obtains the latest date in which 
the input places p. to t k have a token. 

Fuzzy firing date 

The firing transition date o t (x ) of a transition t k is 
determined with respect to the set of transitions {£.} 
simultaneously enabled, Fig.3(b). This date, expressed 
as a possibility distribution, is computed as follows: 



°t k ( T ) = min (^ (x ), earliest (e t . ( x))vt k e p n •; p n e •t j ) 



(6) 



The earliest operation obtains the earliest date 
in which the transitions in a structural conflict are 
enabled. 



Fuzzy timestamp 

For a given place p s , the possibility distribution b p 
may be computed from a p and the firing dates o t (x ) 
of a tj e *p s using the following expression: 




b p =lmax(o tj (T)y&a p yt j <E.p s 



(7) 



The token do not disappear of •£ and appear in t • 
instantaneously. The fuzzy timestamp c is the time 
elapse possibility that a token is in a place P s e P . The 
possibility distribution c p is computed from the occur- 
rence dates of both #p s and p s • , see Fig.3(c). 



c Ps = Imax (earliest (o t (x y\,latest(o t (T))Vt. ^•p s ,t j e p s 



(8) 



Actually, c p represent the fuzzy marking at the 
instant x. 

Matrix Formulation 

Now, we reformulate the expressions (5), (6), (7) and (8) 
allowing a more general an compact representation. 



Imax 

5 = 1 C + • 0\®A 



~ i- -|T latest 

E = [C~ • B 



M r _. j min f earliest 

= [C~ • C" • E 



(9) 
(10) 
(11) 



Figure 3. (a) Conjunction transition, (b) Structural conflict, (c) Attribution and selection place 



K #V m 0K &. $K 




Ca) 



(b) 
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„ f earliest „ latest M 

C = lmax\ C + • 0,C~ • O 



(12) 



where B, E, O and C denote vectors composed by B = 
b Ps ,e tk > °t k >c p s > respectively. 

Modeling Example 

Now we will illustrate the previous matrix formulation 
though a simple example. 

Example 

Consider the system shown in Fig.4(a); it consist £ - 
of two cars, carl and car2, which move along inde- 
pendent and dependent ways executing the set of 
activities Op={Right_carl, Right_car2, Chargecarl, 
Charge_car2, Left_carl,2 Discharge_carl,2}. The 
operation of the system is automated following the 
sequence described in the FTPN of Fig.4(b) in which 
the activities are associated to places. The ending time 
possibility a p for every activity is given in the model. q _ 
We are considering that there are not sensors detect- 
ing the tokens in the system, thus the behavior is then 
analyzed through the estimated state. 

a. Initial conditions: Initially, M ={p 1 }, there- 
fore, the enabling date e t (x ) of transitions t x 
is immediate, i.e., (0,0,0,0). Since [•tj^l, then 

tX T )= e tX x l 

b. Matrix equations: For the obtained the fuzzy sets 

we solve (9)-(12) as follows: 



K 




l 5 Pi 


K 




o t 0d n 

k Pi 


K 




h P3 


K 




o t 0d n 

h Pa 


K 




°t ®«v 

h Ps 


K_ 




h Pe _ 



latest (KA>) 









° k 




e t 


% 




L 2 


% 


= 


e t 

l 3 


°u 




L 4 


_v 




_ l 5 _ 



(13) 



(14) 



(15) 



Figure 4. (a) Two cars system, (b) Fuzzy timed Petri net model 
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C 



lmax(o ts ,o ti 
lmax(o t ,o t2 
lmax(o t ,o h 
lmax(o t ,o t 
lmax(o t ,o t 
lmax(o t ,o t 



(16) 



c. Firing t x : When t : is fired, the token is removed 
from p x ; p 2 and p 3 get one token each one. 

B = [0 (0.9,1,1,1.1) (0.8,1,1,1.2) 0] r 



The possibility sets b ?2 , b p ^ coincide with a p2 

and d ft , respectively. 

Firing t 2 : The fuzzy enabling time and the fuzzy 

occurrence time are computed by (14) and (15), 

respectively. 

6,£=[0 (0.9,1,1,1-1) 0] T 

The set c p2 is the possibility distribution of the 
time at which p 2 is marked. So, we computed 
(16). 

c = [o (0,0,1,1.1) o o o] r 

The set b p4 is the possibility distribution of the 
instant at which place p4 losses the token and it 
can be calculated by (13). 

B = [0 (2.6,3,3,3.4) 0] r 



e. Firing t 3 : Again, using (14), (15) and (16) we 
obtain: 

6,E = [0 (0.8,1,1,1.2) 0] r 
C = [0 (0,0,1,1.2) 0] r 
B = [0 (2.6,3,3,3.4) 0] r 



Figure 5(a) present the marking evolution of one 
cycle and some more steps. C is represented by the 
dashed line and B is represented by the shadowed area. 
Notice that 6 coincide sometimes with B . 



FUZZY STATE EQUATION 

We analyzed equation (1) in order to obtain the fuzzy 
marking equation. C + v k provides information about the 
places that get tokens. Also, we must consider that in 
FTPN the transition firing possibility evolves continu- 
ously. The variation of O (x b ) during xef b modifies 
the possibility of tokens residing in the output places 
of the firing transitions; thus the corresponding term 
to v k in FTPN is rather a variation denoted by ^o(f b ) 
; thus the marking variation is C + A -,. v By a similar 
reasoning on the term C v k corresponds to C A 6(f , in 
FTPN. The operation C + A 6 ^ - C~A 6 ^ ^represents" the 
possible marking change. Considering the marking 
after a time elapse Ax we obtain: 




M(x) = M(x-Ax) + C + A 6 , ) -C-A 6 , 



oft) 



(17) 



Here 



6(h) ' 



6(t)-6(t-At)| (t-A % x)gt; 



and 



A 0(f) = 0(^^)-0(T)|(T-Al,l)ET, 



The marking possibility obtained in ( 1 7) can be greater 
than 1 ; then since FTPN are safe, we use the min func- 
tion to obtain M(t)<1. 

The new marking is denoted by M (x ), i.e., 



M 



(x)=min(M(x-Ax) + C + A 6(fb) -CA 6(f) ,l) (18) 



where 1 is a n-entry vector containing 1 in each en- 
try. 

InitiallyAf (0) = Af . If x ^0 then (18) is solved 
in three steps: 

M(t) = M(t-Ax)+M(t) 
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M(u)=min(Af(x),l) 



Remark. 

If A a , f ,, ^o(x a ) E l^' 1 } ^ e behaviour is that of an ordi- 
nary timed Petri net. 



Example. 

For the system shown in Fig.4, we obtained the mark- 
ing in some instants. The initial marking is M(0) = [1 
0]. The transition t 1 is firing at x = + , therefore 
M(0 + ) = [0 1 1 0] . For x e (o + ,0.8) the marking 
does not change. For x = 1 we obtain: 



M(l) = 



10 
10 
10 
10 
10 
1 



1 
10 
10 
10 
10 
10 

M(1)=M(0.S)+M(1)=[0 1 1 1 Of 

M(l)= min(Af (l),l)=[0 1 1 1 oj 



The marking evolution at some relevant instants is 
shown below: 

t 0.8 1 2 2.8 

M p (t) 1 

M Pz (t) 1 10 

M ft (t) 1 10 

M p4 (t) 111 

M ps (t) 11 1 

M pg (t) 0.66 

Notice that during x e (0, 5), M (x ) coincides with the 
fuzzy timestamp; it is shown in Fig. 5(a). 
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among the bigger possibility M u (x ) that the token is 
in a place u and the possibility M v (x ) that the token 
is in any other place. The function y Yi ( T ) is then 
calculated. 



v|/^(x)=/nz»([M Pu (x)-M Pv (^) 



such that V{p u) p v }e|^|;v*u;^eY 



(19) 



Example. 

The FTPN in Fig.4 has two p-invariants with supports 

Y i = \p x , P 2 y P^ Ps\^^Y 2 =|p 1 ,p 3 ,p 5 ||. Figure 6 shows 
the fuzzy sets C obtained from evolution of the marking 
in the p-component induced by Y u Fig.5(b). 

Definition. The state estimation S, at the instant x is 
described by the function s (x ) e [0, l], which determines 
the possible state of the system among other possible 
states; it is calculated by: 

s(x)=min(y Y (x))\i = l,...,\Y\;Y i GY (20) 
Discrete State From the FTPN 

In order to obtain a possible discrete marking M (x ) of 
the FTPN it is necessary to perform a "defuzzyfication" 
of M(x). This can be accomplished taking into account 
the possible discrete marking M i (x ) of every p-com- 
ponent induced by Y. Before describing the procedure 
to obtain M (x ), we define M(x) as: 



M ( T )=[ w P 1 ( T )- n7 P n ( T )] r | n = l p l 



(21) 



where m pk (x )| k = l,...,n is the estimated marking 
of the place p k ^P . Now, the discrete marking 
can be obtained with the following procedure. 



STATE APPROXIMATION OF THE FTPN 

Marking Estimation 

Definition. The marking estimation S in the instant x 
is described by the function y Yi ( T ) E \P>^\ which rec- 
ognize the possible marked place p u e |^| | i <e |1,...,|y| j, 
among other possible places P v G |^|l v^ii. The func- 
tion \\f Yi (x) evaluates the minimal difference that exist 



Algorithm: Defuzzification 

See Algorithm A. 

Example. 

Following the previous example, the marking M(x) 
during x e (0.08] does not change, that is 

M( T ) = M o+ =[01100 0]. 
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Figure 5. (a) Fuzzy marking evolution, (b) Marking 
estimation, (c) Discrete state 
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For x = 0.95 the new fuzzy marking is 

M(0.95)=[0 1 1 0.5 0.25 oJ\ 
therefore 

^(0.95)= [0 1 Of M 2 (0.95)=[0 1 0] r 
M(0.95)=[0 10 of + [0 1 Of =[0 1 1 Of 
M(0.95)=[0 1 1 Of 

Figure 5(c) shows the marking obtained at differ- 
ent instants. 

FUTURE TRENDS 

Previous results on estimation of Fuzzy Timed State 
Machines and that included in this article are going to 
be integrated for addressing a larger class of PN. 

Another issue currently addressed is the study of 
FTPN including measurable places for dealing with 
sensors or detectable activities within the system; this 
will allow establishing a bound on the uncertainty of 
the estimated state. The optimal placement of sensors 
is an interesting matter of research. 




Algorithm A. 



Input: M(z),Y 
Output: M (t ) 



SteplM(x)^0 
Step 2 V1J | i = 1, ..., ]F] 

Step 2.1 Vp k e ^ : m q = max (M (p k )) 
Step 2.2 Af(p q )=l 
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The aim of this research has been the use of the 
methodology for estimating the DES state of a discrete 
event system for monitoring its behavior and diagnos- 
ing faults. A FTPN is going to be used as a reference 
model and their outputs (measurable marking) have to 
be compared with the outputs of the monitored system; 
the analysis of residuals should provide an early detec- 
tion of system malfunctioning and a plausible location 
of the faulty behavior. 



CONCLUSION 

This article addressed the state estimation problem 
of DES whose the duration of activities is ill known; 
fuzzy sets represent the uncertainty of the ending of 
activities. Several novel notions have been introduced 
in the FTPN definition, and a new matrix formulation 
for computing the fuzzy marking of Marked Graphs 
has been proposed. The extreme situation in which any 
activity of a system cannot be detected by sensors has 
been dealt for illustrating the degradation of the mark- 
ing estimation when a cyclic execution is performed. 
Current research addresses the topics mentioned in the 
above section. 
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KEY TERMS 

Discrete Events Systems: It is the class of systems 
whose behavior is characterized by successions of 
states delimited by asynchronous events. Most of these 
systems have been man made. 

Fuzzy Logic: It is a Knowledge representation 
technique and computing framework whose approach 
is based on degrees of truth rather than the usual "true" 
or "false" of classical logic. 

Fuzzy Petri Nets: It is a family of formalisms 
extending Petri nets by the inclusion of fuzzy sets 
representing usually uncertainty of time elapses. 

Imprecise Marking: The imprecise localization 
of tokens within places of a FTPN; it is computed as 
a possibility distribution. 

Petri Nets: It is a family of formalisms for model- 
ing and analysis of concurrent DES allowing intuitive 
graphical descriptions and providing a simple but sound 
mathematical support. Atimed Petri net includes infor- 
mation about the duration of the modeled activities. 



Marked Graph: It is a Petri Net subclass in which 
every place has only one input transition and one output 
transition. 

State Estimation: It is the inference process that 
determines the current state of a system from the 
knowledge of sequences of inputs and outputs. 

State Machine: It is a Petri Net subclass in which 
every transition has only one input place and one 
output place. 

System Monitoring: It is a surveillance process on 
measurable events and/or outputs of a system; it is often 
used a reference model that specifies a reasonable good 
behavior. Deviations from the reference are analyzed 
and determined if there exist a fault. This process is 
included as a part of a fault diagnosis process. 
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INTRODUCTION 

Fuzzy control systems are developed based on fuzzy 
set theory, attributed to Lotfi A. Zadeh (Zadeh, 1965, 
1973), which extends the classical set theory with 
memberships of its elements described by the clas- 
sical characteristic function (either "is" or "is not" a 
member of the set), to allow for partial membership 
described by a membership function (both "is" and 
"is not" a member of the set at the same time, with a 
certain degree of belonging to the set). Thus, fuzzy set 
theory has great capabilities and flexibilities in solving 
many real-world problems which classical set theory 
does not intend or fails to handle. 

Fuzzy set theory was applied to control systems 
theory and engineering almost immediately after its 
birth. Advances in modern computer technology con- 
tinuously backs up the fuzzy framework for coping 
with engineering systems of a broad spectrum, includ- 
ing many control systems that are too complex or too 
imprecise to tackle by conventional control theories 
and techniques. 



BACKGROUND: 
SYSTEMS 



FUZZY CONTROL 



The main signature of fuzzy logic technology is its 
ability of suggesting an approximate solution to an 
imprecisely formulated problem. From this point of 
view, fuzzy logic is closer to human reasoning than the 
classical logic, where the latter attempts to precisely 
formulate and exactly solve a mathematical or technical 
problem if ever possible. 



Motivations for Fuzzy Control Systems 
Theory 

Conventional control systems theory, developed based 
on classical mathematics and the two-valued logic, is 
relatively mature and complete. This theory has its solid 
foundation built on classical mathematics, electrical 
engineering, and computer technology. It can provide 
rigorous analysis and often perfect solutions when a 
system is precisely defined mathematically. Within 
this framework, some relatively advanced control 
techniques such as adaptive, robust and nonlinear 
control theories have gained rapid development in the 
last three decades. 

However, conventional control theory is quite lim- 
ited in modeling and controlling complex dynamical 
systems, particularly ill-formulated and partially-de- 
scribed physical systems. Fuzzy logic control theory, 
on the contrary, has shown potential in these kinds of 
non-traditional applications. Fuzzy logic technology 
allows the designers to build controllers even when 
their understanding of the system is still in a vague, 
incomplete, and developing phase, and such situations 
are quite common in industrial control practice. 

General Structure of Fuzzy Control 
Systems 

Just like other mathematical tools, fuzzy logic, fuzzy 
set theory, fuzzy modeling, fuzzy control methods, etc., 
have been developed for solving practical problems. 
In control systems theory, if the fuzzy interpretation 
of a real-world problem is correct and if fuzzy theory 
is developed appropriately, then fuzzy controllers can 
be suitably designed and they work quite well to their 
advantages. The entire process is then returned to the 
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original real-world setting, to accomplish the desired 
system automation. This is the so-called "fuzzifica- 
tion — fuzzy operation — defuzzification" routine in 
fuzzy control design. The key step — fuzzy opera- 
tion — is executed by a logical rule base consisting of 
some IF-THEN rules established by using fuzzy logic 
and human knowledge (Chen & Pham, 1999, 2006; 
Drianker, Hellendoorn & Reinfrank, 1993; Passino & 
Yurkovich, 1998;Tanaka, 1996;Tanaka&wang, 1999; 
Wang, 1994; Ying, 2000). 

Fuzzification 

Fuzzy set theory allows partial membership of an ele- 
ment with respect to a set: an element can partially 
belong to a set and meanwhile partially not belong to 
the same set. For example, an element, x, belonging to 
the set, X, IS specified by a (normalized) membership 
function, \i x : X —> [0,1]. There are two extreme cases: 
[i x (x) = means x <£ X and |i x (x) = 1 means xelin 
the classical sense. But |i x (x) = 0.2 means x belongs 
to X only with grade 0.2, or equivalently, x does not 
belong to X with grade 0.8. Moreover, an element can 
have more than one membership value at the same time, 
such as [i x (x) = 0.2 and [i x (x) = 0.6, and they need not 
be summed up to one. The entire setting depends on 
how large the set X is (or the sets X and Y are) for the 
associate members, and what kind of shape a member- 
ship function should have in order to make sense of the 
real problem at hand. A set, X, along with a membership 
function defined on it, ji x Q, is called a fuzzy set and is 
denoted (X, [i x ). More examples of fuzzy sets can be 
seen below, as the discussion continues. This process 
of transforming a crisp value of an element (say x = 
0.3) to a fuzzy set (say x = 0.3 gI=[0,1] with [i x (x) 
= 0.2) is called fuzzification. 

Given a set of real numbers, X = [-1,1], a point x 
g X assumes a real value, say x = 0.3. This is a crisp 
number without fuzziness. However, if a membership 
function |i x (-) is introduced to associate with the set X, 
then (X, \i x ) becomes a fuzzy set, and the (same) point 
x = 0.3 has a membership grade quantified by |i x (-) (for 
instance, [i x (x) = 0.9). As a result, x has not one but two 
values associated with the point: x = 0.3 and ji x (x) = 
0.9. In this sense, x is said to have been fuzzified. For 
convenience, instead of saying that "x is in the set X 
with a membership value |i x (x)," in common practice 
it is usually said "x is ," while one should keep in mind 
that there is always a well-defined membership function 



associated with the set X. If a member, x, belongs to 
two fuzzy sets, one says "x is X x AND x is X 2 ," and so 
on. Here, the relation AND needs a logical operation 
to perform. As a result, this statement eventually yields 
only one membership value for the element x, denoted 
by [i x xX (x). There are several logical operations to 
implement the logical AND; they are quite different but 
all valid within their individual logical system. A com- 
monly used one is \i XlXX2 (x) = min {^(x), ^(x)}. 

Fuzzy Logic Rule Base 

The majority of fuzzy logic control systems are knowl- 
edge-based systems. This means that either their fuzzy 
models or their fuzzy logic controllers are described 
by fuzzy logic IF-THEN rules. These rules have to 
be established based on human expert's knowledge 
about the system, the controller, and the performance 
specifications, etc., and they must be implemented by 
performing rigorous logical operations. 

For example, a car driver knows that if the car moves 
straight ahead then he does not need to do anything; if 
the car turns to the right then he needs to steer the car 
to the left; if the car turns to the right by too much then 
he needs to take a stronger action to steer the car to the 
left much more, and so on. Here, "much" and "more" 
etc. are fuzzy terms that cannot be described by classi- 
cal mathematics but can be quantified by membership 
functions (see Fig. 2, where part (a) is an example of 
the description "to the left"). The collection of all such 
"if . . . then ..." principles constitutes a fuzzy logic rule 
base for the problem under investigation. To this end, 
it is helpful to briefly summarize the experience of the 
driver in the following simplified rule base: Let X = 
[-180°, 1 80°], x be the position of the car, ji (•) be the 
membership function for the moving car turning "to 
the left," (i r . ht (-) the membership function for the car 
turning "to the right," and ji (-) the membership function 
for the car "moving straight ahead." Here, simplified 
statements are used, for instance, "x is X " means "x 
belongs toX with a membership value \i left (x)" etc. Also, 
similar notation for the control action u of the driver 
is employed. Then, a simple typical rule base for this 
car-driving task is 




R d) 

R(2) 
R (3) 



IFxisX l 
IFxisX 
IFxisX n 



left 
right 



THEN u is U 

r 

THEN u is LT 
THEN u is U n 



right 
left 
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where X Q means moving straight ahead (not left nor 
right), as described by the membership function shown 
in Fig. 2(c), and "u is U Q " means u = (no control ac- 
tion) with a certain grade (if this grade is 1 , it means 
absolutely no control action). Of course, this description 
only illustrates the basic idea, which is by no means 
a complete and effective design for a real car-driving 
application. 

In general, a rule base of r rules has the form 



Thus, what should the control action be? To simplify 
this discussion, suppose that the control action is simply 
u = -x with the same membership functions [i x = \i v for 
all cases. Then, a natural and realistic control action 
for the driver to take is a compromise between the two 
required actions. Among several possible compromise 
(or, average) formulas for this purpose, the most com- 
monly adopted one that works well in most cases is the 
following weighted average formula: 



R^iIFxAsX^AND-ANDx is X THEN u is U, 

1 kl m km k 



(1) 



where m > 1 and k = l,---, r. 

Defuzzification 



An element of a fuzzy set may have more than one 
membership value. In Fig. 1 , for instance, if x = 5° then 
it has two membership values: \x ri M (x) = 5/180 ~ 0.28 
and |i (x) = 0.5. This means that the car is moving to the 
right by a little. According to the above-specified rule 
base, the driver will take two control actions simultane- 
ously, which is unnecessary and physically impossible. 



JWM + HoC") 
0.28 x (-5°) + 0.5 x (-5°) 
0.28 + 0.5 



= -0.5° 



Here, the result is interpreted as "the driver should 
turn the car to the left by 5°." This averaged outputs 
is called defuzzification, which yields a single crisp 
value for the control, which may actually yield similar 
averaged results in general. 

The result of defuzzification usually is a physical 
quantity acceptable by the original real system. Whether 
or not this defuzzification result works well depends 



Figure 1. Membership functions for directions of a moving car 
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Figure 2. A typical fuzzy logic controller 
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on the correctness and effectiveness of the rule base, 
while the latter depends on the designer's knowledge 
and experience about the physical system or process for 
control. Just like any of the classical design problems, 
there is generally no unique solution for a problem; an 
experienced designer usually comes out with a better 
design. 

A general weighted average formula for defuzzi- 
fication is the following convex combination of the 
individual outputs: 



w. 



output = ^a t u t := X^ u t 

i=1 Lt=i w i 



(2) 



with notation referred to the rule base (1), where 



Wi\h,( u i)> a i := 



w. 



IL 



w. 



0, / = !,•••, r, X a ' =1 



Sometimes, depending on the design or application, 
the weights are 



7=1 



The overall structure of a fuzzy logic controller is 
shown in Fig. 2. 



MAIN FOCUS OF THE CHAPTER: 
SOME BASIC FUZZY CONTROL 
APPROACHES 

A Model-Free Approach 

This general approach of fuzzy logic control works for 
traj ectory tracking for a conventional dynamical system 
that does not have a precise mathematical model. 

The basic setup is shown in Fig. 3, where the plant is 
a conventional system without a mathematical descrip- 
tion and all the signals (the reference set-point sp, output 
y(t), control u(t), and error e(t) = sp -y(t)) are crisp. 
The objective is to design a controller to achieve the 
goal e(t) — > as t — > oo, assuming that the system inputs 
and outputs are measurable by sensors on line. 

If the mathematical formulation of the plant is 
unknown, how can one develop a controller to con- 
trol this plant? Fuzzy logic approach turns out to be 
advantageous in this situation: it only uses the plant 
inputs and outputs, but not the state variables nor any 
other information. After the design is completed, the 
entire dashed-block in Fig. 2 is used to replace the 
"controller" block in Fig. 3. 

As an example, suppose that the physical reference 
set-point is the degree of temperature, say 40°F, and 
that the designer knows the range of the error signal, 
e(t) = 40° -y(i), is within X = [-25°, 45°], and assume 
that the scale of control is required to be in the unit 
of 1°. Then, the membership functions for the error 
signal to be "negative large" (NL), "positive large" 
(PL), and "zero" (ZO) may be chosen as shown in Fig. 
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Figure 3. A typical reference set-point tracking control system 
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4. Using these membership functions, the controller is 
expected to drive the output temperature to be within 
the allowable range: 40° ±1°. With these membership 
functions, when the error signal e(t) = 5°, for instance, 



Figure 4. Membership function for the error tempera- 
ture signal 
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it is considered to be "positive large" with membership 
value one, meaning that the set-point (40°) is higher 
than y(t) by too much. 

The output from the fuzzification module is a fuzzy 
set consisting of the interval X and three member- 
ship functions, \i , \i pL and |i zo , in this example. The 
output from fuzzification will be the input to the next 
module — the fuzzy logic rule base — which only takes 
fuzzy set inputs to be compatible with the logical IF- 
THEN rules. 

Figure 5 is helpful for establishing the rule base. If 
e > at a moment, then the set-point is higher than the 
output y (since e = 40° -y), which corresponds to two 
possible situations, marked by a and d respectively. 
To further distinguish these two situations, one may 
use the rate of change of the error, e = -y . Here, since 
the set-point is a constant, its derivative is zero. Using 
information from both e and e, one can completely 
characterize the changing situation of the output tem- 
perature at all times. If, for example, e > and e > 0, 



Figure 5. Temperature set-point tracking example 
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then the temperature is currently at situation d rather 
than situation a, since e > means y < which, in turn, 
signifies that the curve is moving downward. 

Based on the above observation from the physi- 
cal situations of the current temperature against the 
set-point, a simple rule base can be established as 
follows: 

R 1 : IF e > AND > THEN n(t+) = -C ■ n(t); 

£ 2 : IF e > AND < THEN u(t+) = C ■ u(t); 

£ 3 : IF e < AND > THEN n(t+) = C ■ u(t); 

£ 4 : IF e < AND < THEN u(t+) = -C ■ u(t); 

otherwise (e.g., e = or e = 0), i/(t+) = i/(t), till next 
step, where C > is a constant control gain and t + can 
be just t + 1 in discrete time. 

In the above, the first two rules are understood as 
follows (other rules can be similarly interpreted): 

1. R (1) : e>0 and > 0. As analyzed above, the tem- 
perature curve is currently at situation d, so the 
controller has to change its moving direction to 
the opposite by changing the current control action 
to the opposite (since the current control action 
is driving the output curve downward). 

2. R (2) : e > and < 0. The current temperature curve 
is at situation a, so the controller does not need 
to do anything (since the current control action 
is driving the output curve up toward the set- 
point). 

The switching control actions may take different 
forms, depending on the design. One example is u(t + 
1) = u(t) + Au(t), among others (Chen & Pham, 1999, 
2006). 

Furthermore, to distinguish "positive large" from 
just "positive" for e > 0, one may use those member- 
ship functions shown in Fig. 4. Since the error signal 
e(t) is fuzzified in the fuzzification module, one 
can similarly fuzzify the auxiliary signal e(t) in the 
fuzzification module. Thus, there are two fuzzified 
inputs, e and e, for the controller, and they both have 
corresponding membership functions describing their 
properties as "positive large" (|i pz ), "negative large" 
(l^), or "zero" (|i zo ), as shown in Fig. 5. Thus, for the 
rule base, one may replace it by a set of more detailed 
rules as follows: 



R 1 : IF e = PL AND > THEN u(t+l)= -|n pL (e) • 

i/(t); 
R 2 : IF e = PS AND > THEN n(t+l) = -(l-jLL ps (e)) 

• n(t); 

R 3 : IF e = PL AND < THEN i/(t+l) = |Li pL (e) • 

R 4 : IF e = PS AND < THEN u(t+l) = (l-|a ps (e)) 

• n(t); 

R 5 : IF e = NL AND > THEN i/(t+l) = |Li NL (e) • 

i/(t); 
R 6 : IFe = NSAND > THEN i/(t+l) = (l-|Li NS (e)) 

• n(t); 

R 7 : IF e = NL AND < THEN i/(t+l) = -|n NL (e) • 

i/(0; 

R s : IFe = NSAND < THEN i/(t+l) = -(l-|Li NS (e)) 

• u(t); 



otherwise, i/(t+l) = u(t). Here and below, "= PL" 
means "is PL," etc. In this way, the initial rule base is 
enhanced and extended. 

In the defuzzification module, new membership 
functions are needed for the change of the control action, 
u(t + 1) or Ai/(0, if the enhanced rule base described 
above is used. This is because both the error and the 
rate of change of the error signals have been fuzzified 
to be "positive large" or "positive small," the control 
actions have to be fuzzified accordingly (to be "large" 
or "small"). 

Now, suppose that a complete, enhanced fuzzy logic 
rule base has been established. Then, in the defuzzifi- 
cation module, the weighted average formula can be 
used to obtain a single crisp value as the control action 
output from the controller (see Fig. 2): 




U(t + 1): 



zr = i<v u >( t+i ) 

ZN 



This is an average value of the multiple (N = 8 in 
the above rule base) control signals at step t + 1 , and 
is physically meaningful to the given plant. 

A Model-Based Approach 

If a mathematical model of the system, or a fairly good 
approximation of it, is available, one may be able to 
design a fuzzy logic controller with better results such 
as performance specifications and guaranteed stability. 
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This constitutes a model-based fuzzy control approach 
(Chen & Zhang, 1 997; Malki, Li & Chen, 1 994; Malki, 
Feigenspan, Misir & Chen, 1997; Sooraksa & Chen, 
1998; Ying, Siler & Buckley, 1990). 

For instance, a locally linear fuzzy system model is 
described by a rule base of the following form: 



R^ k) :IF V s x n ^^ ••' ^ 

AND x m is X km THEN x = \ x + B k u 



(3) 



Q>0for all k = 1 ,---,r, then the fuzzy controlled system 
(6) is asymptotically stable about zero. 

This theorem provides a basic (sufficient) condition 
for the global asymptotic stability of the fuzzy control 
system, which can also be viewed as a criterion for 
tracking control of the system trajectory to the zero set- 
point. Clearly, stable control gain matrices {K k Y k=1 may 
be determined according to this criterion in a design. 



where {A k } and {B k } are given constant matrices, x FUTURE TRENDS 
= [x lV ..,x ] r is the state vector, and u = \u„...,u Y is 

L 1' ' m J ' — L 1' ' n J 

a controller to be designed, with m>n> 1 , and k = 
l,--,r. The fuzzy system model (3) may be rewritten 
in a more compact form as follows: 



This topic will be discussed elsewhere in the near 
future. 



* = 2X(4^+ B ^) = a (V(*))x + B([i(x))u 



(4) 



where 



Uj=l 



Based on this fuzzy model, (3) or (4), a fuzzy con- 
troller u(t) can be designed by using some conventional 
techniques. For example, if a negative state-feedback 
controller is preferred, then one may design a controller 
described by the following ruse base: 



R^:IF Xl isX kl 



AND x m is X 



AND ••• 
THEN u = 



"*** 



(5) 



where {K k Y k=1 are constant control gain matrices to be 
determined, k= 1,— ,r. Thus, the closed-loop controlled 
system (4) together with (5) becomes 



?(fc) . 



R£:IF Xl isX k 



AND 



AND x m is X km THEN x = [\- BK k ]x 



(6) 



For this feedback controlled system, the following 
is a typical stability condition [1,2,10]: 

If there exists a common positive definite and symmetric 
constant matrix P such that A^P + P\ = -Q for some 



CONCLUSION 

The essence of systems control is to achieve automa- 
tion. For this purpose, a combination of fuzzy control 
technology and advanced computer facility available 
in the industry provides a promising approach that can 
mimic human thinking and linguistic control ability, so 
as to equip the control systems with a certain degree 
of artificial intelligence. It has now been realized that 
fuzzy control systems theory offers a simple, realistic 
and successful addition, or sometimes an alternative, 
for controlling various complex, imperfectly modeled, 
and highly uncertain engineering systems, with a great 
potential in many real- world applications. 
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KEY TERMS 

Defuzzification: A process that converts fuzzy 
terms to conventional expressions quantified by real- 
valued functions. 

Fuzzification: Aprocess that converts conventional 
expressions to fuzzy terms quantified by fuzzy mem- 
bership functions. 

Fuzzy Control: A control method based on fuzzy 
set and fuzzy logic theories. 

Fuzzy Logic: A logic that takes on continuous 
values in between and 1. 

Fuzzy Membership Function: A function defined 
on fuzzy set and assumes continuous values in between 
and 1. 

Fuzzy Set: A set of elements with a real-valued 
membership function describing their grades. 

Fuzzy System: A system formulated and described 
by fuzzy set-based real- valued functions. 
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INTRODUCTION 

The inductive learning methodology known as decision 
trees, concerns the ability to classify objects based on 
their attributes values, using a tree like structure from 
which decision rules can be accrued. In this article, a 
description of decision trees is given, with the main 
emphasis on their operation in a fuzzy environment. 

A first reference to decision trees is made in Hunt et 
al. (1966), who proposed the Concept learning system 
to construct a decision tree that attempts to minimize 
the score of classifying chess endgames. The example 
problem concerning chess offers early evidence support- 
ing the view that decision trees are closely associated 
with artificial intelligence (AI). It is over ten years later 
that Quinlan (1 979) developed the early work on deci- 
sion trees, to introduced the Interactive Dichotomizer 
3 (ID3). The important feature with their development 
was the use of an entropy measure to aid the decision 
tree construction process (using again the chess game 
as the considered problem). 

It is ID3, and techniques like it, that defines the 
hierarchical structure commonly associated with deci- 
sion trees, see for example the recent theoretical and 
application studies of Pal and Chakraborty (2001), 
Bhatt and Gopal (2005) and Armand et al. (2007). 
Moreover, starting from an identified root node, paths 
are constructed down to leaf nodes, where the attributes 
associated with the intermediate nodes are identified 
through the use of an entropy measure to preferentially 
gauge the classification certainty down that path. Each 
path down to a leaf node forms an 'if .. then ..' decision 
rule used to classify the objects. 

The introduction of fuzzy set theory in Zadeh 
(1965), offered a general methodology that allows no- 
tions of vagueness and imprecision to be considered. 
Moreover, Zadeh's work allowed the possibility for 
previously defined techniques to be considered with a 
fuzzy environment. It was over ten years later that the 
area of decision trees benefited from this fuzzy envi- 
ronment opportunity (see Chang and Pavlidis, 1977). 
Since then there has been a steady stream of research 



studies that have developed or applied fuzzy decision 
trees (FDTs) (see recently for example Li et a/., 2006 
and Wang eta/., 2007). 

The expectations that come with the utilisation 
of FDTs are succinctly stated by Li et al. (2006, p. 
655); 

"Decision trees based on fuzzy set theory combines 
the advantages of good comprehensibility of decision 
trees and the ability of fuzzy representation to deal with 
inexact and uncertain information. " 

Chiang and Hsu (2002) highlight that decision trees 
has been successfully applied to problems in artificial 
intelligence, pattern recognition and statistics. They 
go onto outline a positive development the FDTs offer, 
namely that it is better placed to have an estimate of 
the degree that an object is associated with each class, 
often desirable in areas like medical diagnosis (see 
Quinlan (1987) for the alternative view with respect 
to crisp decision trees). 

The remains of this article look in more details at 
FDTs, including a tutorial example showing the rudi- 
ments of how an FDT can be constructed. 



BACKGROUND 

The background section of this article concentrates on a 
brief description of fuzzy set theory pertinent to FDTs, 
followed by a presentation of one FDT technique. 

In fuzzy set theory (Zadeh, 1965), the grade of 
membership of a value x to a set S is defined through 
a membership function ju s (x) that can take a value in 
the range [0,1]. The accompanying numerical attribute 
domain can be described by a finite series of MFs that 
each offers a grade of membership to describe x, which 
collectively form its concomitant fuzzy number. In this 
article, MFs are used to formulate linguistic variables 
for the considered attributes. These linguistic variables 
are made up of sets of linguistic terms which are defined 
by the MFs (see later). 
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Figure 1. Example membership function and their use in a linguistic variable 





a l4 =a :: a L5 =a 2 . 



Surrounding the notion of MFs is the issue of their 
structure (Dombi and Gera, 2005). Here, piecewise 
linear MFs are used to define the linguistic terms pre- 
sented, see Figure 1 . 

In Figure l(top), a single piecewise linear MF is 
shown along with the defining values that define it, 
namely, a 11? a 12 , a 13 , a 14 and a 15 . The associated 
mathematical structure of this specific form of MF is 
given below; 



\i(x} 




x-a 



0.5- 

( 

0.5 + 0.5- 
1 



JM 



x-a 



j,2 



if x <a jtl 

if a ;1 < x^a j; 



if a 2 < x <a. >3 



if x = a 



J,3 



1-0.5- 



x-a 



i,3 



x-a, 4 
0.5-0.5 ^~ 



a i,5- a M 



ifa j3 <x<a j4 



if a j4 < x^a J;5 
if a j5 < x 



As mentioned earlier, MFs of this type are used to 
define the linguistic terms which make up linguistic 
variables. An example of a linguistic variable X based 
on two linguistic terms, XI and X2, is shown in Figure 
2(bottom), where the overlap of the defining values for 



each linguistic term is evident. Moreover, using left and 
right limits of the X domain as -go and oo, respectively, 
the sets of defining values are (in list form); XI - [-co, 
-oo, a 13 , a 14 , a lf5 ] andX2 - [a 21 , a 22 , a 23 , oo, oo], where 

a i,3 = a 2,l' a i,4 = a 2,2 aild a i,5 = a 2,3' 

This section now goes on to outline the technical 
details of the fuzzy decision tree approach introduced 
in Yuan and Shaw (1995). With an inductive fuzzy 
decision tree, the underlying knowledge related to a 
decision outcome can be represented as a set of fuzzy 
'if .. then ..' decision rules, each of the form; 

If (A 1 is Tl ) and (A 2 is T* ) ... and (A k is T? ) then C 
is C, where A 19 A 2 , .., A k and C are linguistic variables 
for the multiple antecedents (A.'s) and consequent (C) 
statements used to describe the considered objects, and 
T(A k ) ={T ± \ T 2 k , .. 7^ } and {C 1? C 2 , . . ., CJ are their 
respective linguistic terms, defined by the MFs M T k ( x ) 
etc. The MFs, ju Tk (x) and ju c (y), represent the grade 
of membership of an object's antecedent A. being T* 
and consequent C being C, respectively. 

AMF |u(x) from the set describing a fuzzy linguistic 
variable 7 defined on X, can be viewed as a possibility 
distribution of Yon X, that is tt(x) = |u(x), for all x e X 
the values taken by the objects in U (also normalized 
so max XGX p (x) =1). The possibility measure EJY) of 
ambiguity is defined by 



E(Y) = g(n) 



I(P/ 

Z=l 



Pz* + i)ln[z], 
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where n* = {n*, n*, . . ., 7i n *} is the permutation of the 
normalized possibility distribution tt = {n{x^ n(x 2 ), 
..., 7i(x n )}, sorted so that n* > n*. +1 for z = 1, .., n, and 

P„* + i=0. 

The ambiguity of attribute A (over the objects u v 
.., u ) is given as: 



1 



G(P\F)= Ew(£ i |F)G(£ i nF), 



1=1 



E„(A)=-IE a (A(u,) , 



m 



f=i 



where 

E (A(u))= g(M^)/ max (M"/))), 

« v v ' /y / l<j<s 

with T p ..., T the linguistic terms of an attribute (an- 
tecedent) with m objects. 

The fuzzy subsethood S(A, B) measures the degree 
to which A is a subset of B, and is given by, 

S(A, B) = Emin(n A (ii), \i B (u))/X V A ( u l 



Given fuzzy evidence E, the possibility of classifying 
an object to the consequent C can be defined as, 

n(C.\E) = S^Q/maxSCE,^), 

where the fuzzy subsethood S(E, C.) represents the 
degree of truth for the classification rule ('if E then C.'). 
With a single piece of evidence (a fuzzy number for an 
attribute), then the classification ambiguity based on 
this fuzzy evidence is defined as: G(E) = g(n(C\ E)), 
which is measured using the possibility distribution 
n(C\E) = (n(C 1 \E),...,n(C L \E)). 

The classification ambiguity with fuzzy partition- 
ing P = [E v ..., E k } on the fuzzy evidence F, denoted 
as G(P\ F), is the weighted average of classification 
ambiguity with each subset of partition: 



where G(E. n F) is the classification ambiguity with 
fuzzy evidence E.r\F, and where w(E.| F) is the weight 
which represents the relative size of subset E n F in 

F:w(EIF) = 



Z min(jLi E .(Li),^i F (ii))/ I E min(jLi E .(ii),jLi F (u)) 

ueU I j=l\u&U J 



In summary, attributes are assigned to nodes based 
on the lowest level of classification ambiguity. A node 
becomes a leaf node if the level of subsethood is higher 
than some truth value p assigned to the whole of the 
fuzzy decision tree. The classification from the leaf node 
is to the decision group with the largest subsethood 
value. The truth level threshold p controls the growth 
of the tree; lower p may lead to a smaller tree (with 
lower classification accuracy), higher p may lead to a 
larger tree (with higher classification accuracy). 



MAIN THRUST 

The main thrust of this article is a detailed example 
of the construction of a fuzzy decision tree. The de- 
scription includes the transformation of a small data 
set into a fuzzy data set where the original values are 
described by their degrees of membership to certain 
linguistic terms. 

The small data set considered, consists of five ob- 
jects, described by three condition attributes Tl, T2 
and T3, and classified by a single decision attribute 
C, see Table 1. 

If these values are considered imprecise, fuzzy, 
there is the option to transform the data values in 



Table 1. Example data set 



Object 


Tl 


T2 


T3 


C 


"t 


112 


45 


205 


7 


", 


85 


42 


192 


17 


", 


130 


58 


188 


22 


" 4 


93 


54 


203 


29 


u , 


132 


39 


189 


39 
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Figure 2. Membership functions defining the linguistic terms, C L and C H , for the decision attribute C 





25 C 



fuzzy values. Here, an attribute is transformed into a 
linguistic variable, each described by two linguistic 
terms, see Figure 2. 

In Figure 2, the decision attribute C is shown to be 
described by the linguistic terms, C L and C H (possibly 
denoting the terms low and high). These linguistic terms 
are themselves defined by MFs (|u c (•) and |u c (•). The 
hypothetical MFs shown have the respective defining 
terms of, |u (•): [-oo, -°°, 9, 25, 32] and |u (•): [9, 25, 



32, oo, oo]. To demonstrate their utilisation, for the object 
u v with a value C = 1 7, its fuzzification creates the two 
values \i c (17) = 0.750 and \i c (17) = 0.250, the larger 
of which is associated with the high linguistic term. 

A similar series of membership functions can be 
constructed for the three condition attributes, Tl, T2 
andT3, Figure 3. 

In Figure 3, the linguistic variable version of each 
condition attribute is described by two linguistic terms 



Figure 3. Membership functions defining the linguistic terms for the condition attributes, Tl, T2 and T3 




T3 198 



203 20R 
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Object 


T1 = [T1 T ,T1 W ] 


T2 = [T2 T ,T2J 


T3 = [T3 T ,T3J 


C = [C T ,CJ 


°i 


[0.433, 0.567] - H 


[1.000, 0.000] - L 


[0.000, 1.000] - H 


[1.000, 0.000] - L 


o, 


[1.000, 0.000] - L 


[1.000, 0.000] - L 


[0.750, 0.250] - L 


[0.750, 0.250] - L 


°< 


[0.000, 1.000] - H 


[0.227, 0.773] - H 


[0.917, 0.083] - L 


[0.594, 0.406] - L 


°, 


[1.000, 0.000] - L 


[0.409, 0.591] - H 


[0.000, 1.000] - H 


[0.214, 0.786] -H 


°, 


[0.000, 1.000] - H 


[1.000, 0.000] - L 


[0.875, 0.125] -L 


[0.000, 1.000] - H 



(possibly termed as low and high), themselves defined 
by MFs. The use of these series of MFs is the ability 
to fuzzify the example data set, see Table 2. 

In Table 2, each object is described by a series 
of fuzzy values, two fuzzy values for each attribute. 
Also shown in Table 2, in bold, are the larger of the 
values in each pair of fuzzy values, with the respec- 
tive linguistic term this larger value is associated with. 
Beyond the fuzzification of the data set, attention turns 
to the construction of the concomitant fuzzy decision 
tree for this data. Prior to this construction process, a 
threshold value of P = 0.75 for the minimum required 
truth level was used throughout. 

The construction process starts with the condition 
attribute that is the root node. For this, it is necessary 
to calculate the classification ambiguity G(E) of each 
condition attribute. The evaluation of a G(E) value is 
shown for the first attribute Tl (i.e. g(n(C\ Tl))), where 
it is broken down to the fuzzy labels L and H, for L; 



along with G(T1 H ) = 0.572, then G(T1) = (0.514 + 
0.572)/2 = 0.543. Compared with G(T2) = 0.579 and 
G(T2) = 0.583, the condition attribute Tl, with the 
least classification ambiguity, forms the root node for 
the desired fuzzy decision tree. 

The subsethood values in this case are; forTl: S(T1 L , 
C L ) = 0.574 and S(T1 L , C H ) = 0.426, and S(T2 H , C L ) 
= 0.452 and S(T2 H , C H ) = 0.548. For T2 L and T2 H , the 
larger subsethood value (in bold), defines the possible 
classification for that path. In both cases these values 
are less that the threshold truth value 0.75 employed, so 
neither of these paths can be terminated to a leaf node, 
instead further augmentation of them is considered. 

With three condition attributes included in the 
example data set, the possible augmentation to T1 L 
is with either T2 or T3. Concentrating on T2, where 
with G(T1 L ) = 0. 0.514, the ambiguity with partition 
evaluated for T2 (G(T1 L and T2| C)) has to be less than 
this value, where; 



tt(C| T1 L )= SCTl^C,) ImaxSCTl^Cj), 
considering C L and C H with the information in Table 

l; 



S(T1 T ,C T )= Z minCnn (ii),jic (u))/5: Ht l ( u ) = 

L L u&U I u&U 



k 

I 

i=l 



G(T1 and T2| C) = Zw(T2 z . | T1 L )G(T1 L nT2 f ). 



Starting with the weight values, in the case of T1 L 
and T2 L , it follows; 

w(T2 I JTl I ) = 



1.398/2.433 = 0.574, 

whereas, S(T1 L ,C H )= 0.426. Hence n= {0.574,0.426}, 
giving the ordered normalized form of n* = {1.000, 
0.741}, with p 3 * =0,then 



G(T1) = g(7r(C|Tl I )) : 



:Z(<- 

z=l 



f+1 



)ln[z] = 0.514, 



£ min(|j (u), |u T1 (u))/ £ £ min(|Li T2 (u), \x T1 (u)) 



1.842/2.433 = 0.757. 



Similarly w(T2 R | T1 L ) = 0.243, hence; 
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G(T1 L and T2| C) 

x G(Tl L n T2 H ) = 0.757 x 0.327 
0.309, 



0.757 x G(Tl L n T2 L ) + 0.699 
0.699 x 0.251 = 



A concomitant value for G(T1 L andT3| C) = 0.487, 
the lower of these (G(T1 L and T2| C)) is lower than 
the concomitant G(T1 L ) = 0.514, so less ambiguity 
would be found if the T2 attribute was augmented to 
the path Tl = L. The subsequent subsethood values in 
this case for each new path are; T2 L ; S(T1 L n T2 L , C L ) 
= 0.759 and S(T1 L n T2 L , C H ) = 0.358; T2 R : S(T1 L n 
T2 H , C L ) = 0.363 and S(T1 L n T2 R , C H ) = 1.000. With 
each suggested classification path, the largest subset- 
hood value is above the truth level threshold, therefore 
they are both leaf nodes leading from the Tl = L path. 
The construction process continues in a similar vein 
for the path Tl = H, with the resultant fuzzy decision 
tree in this case presented in Figure 4. 

The fuzzy decision tree in Figure 8 shows five rules 
(leaf nodes), Rl, R2, ..., R5, have been constructed. 
There are a maximum of four levels to the tree shown, 
indicating a maximum of three condition attributes are 
used in the rules constructed. In each non-root node 
shown the subsethood levels to the decision attribute 
terms C = L and C = H are shown. On the occasions 
when the larger of the subsethood values is above the 



defined threshold value of 0.75 then they are shown in 
bold and accompany the node becoming a leaf node. 
The interpretative power of FDTs is shown by 
consideration of the rules constructed. For the rule R5 
it can be written down as; 

'If Tl = H and T3 = H then C = L with truth level 
0.839.' 

The rules can be considered in a more linguistic 
form, namely; 

Tf Tl is low and T3 is high then C is low with truth 
level 0.839.' 

It is the rules like this one shown that allow the 
clearest interpretability to the understanding of results 
in classification problems when using FDTs. 



FUTURE TRENDS 

Fuzzy decision trees (FDTs) benefit from the inductive 
learning approach that underpins their construction, to 
aid in the classification of objects based on their values 
over different attribute. Their construction in a fuzzy 




Figure 4. Fuzzy decision tree for example data set 
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C=H 1.000 


C=L 0.331 
C=H 0.715 
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environment allows for the potential critical effects 
of imprecision to be mitigated, as well as brings a 
beneficial level of interpretability to the results found, 
through the decision rules defined. 

As with the more traditional 'crisp' decision tree 
approaches, there are issues such as the complexity of 
the results, in this case the tree defined. Future trends 
will surely include how FDTs can work on re-grading 
the complexity of the tree constructed, commonly 
known as pruning. Further, the applicability of the rules 
constructed, should see the use of FDTs extending in 
the range of applications it can work with. 



CONCLUSION 

The interest in FDTs, with respect to AI, is due to the 
inductive learning and linguistic rule construction 
processes that are inherent with it. The induction un- 
dertaken, truly lets the analysis create intelligent results 
from the data available. Many of the applications FDTs 
have been used within, such medicine, have benefited 
greatly from the interpretative power of the readable 
rules. The accessibility of the results from FDTs should 
secure it a positive future. 
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KEY TERMS 

Condition Attribute: An attribute that describes 
an object. Within a decision tree it is part of a non-leaf 
node, so performs as an antecedent in the decision rules 
used for the final classification of an object. 

Decision Attribute: An attribute that characterises 
an object. Within a decision tree is part of a leaf node, 
so performs as a consequent, in the decision rules, from 
the paths down the tree to the leaf node. 

Decision Tree: Atree-like structure for representing 
a collection of hierarchical decision rules that lead to 
a class or value, starting from a root node ending in a 
series of leaf nodes. 
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Induction: A technique that infers generalizations 
from the information in the data. 

Leaf Node: A node not further split, the terminal 
grouping, in a classification or decision tree. 

Linguistic Term: One of a set of linguistic terms, 
which are subj ective categories for a linguistic variable, 
each described by a membership function. 

Linguistic Variable: A variable made up of a num- 
ber of words (linguistic terms) with associated degrees 
of membership. 

Path: A path down the tree from root node to leaf 
node, also termed a branch. 



Membership Function: A function that quantifies 
the grade of membership of a variable to a linguistic 
term. 

Node: A junction point down a path in a decision 
tree that describes a condition in an if-then decision 
rule. From a node, the current path may separate into 
two or more paths. 

Root Node: The node at the tope of a decision 
tree, from which all paths originate and lead to a leaf 
node. 
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INTRODUCTION 

Graph theory has numerous application to problems in 
systems analysis, operations research, economics, and 
transportation. However, in many cases, some aspects 
of a graph-theoretic problem may be uncertain. For 
example, the vehicle travel time or vehicle capacity 
on a road network may not be known exactly. In such 
cases, it is natural to deal with the uncertainty using 
the methods of fuzzy sets and fuzzy logic. 

Hypergraphs (Berge, 1989) are the generalization of 
graphs in case of set of multiarity relations. It means the 
expansion of graph models for the modeling complex 
systems . In case of modelling systems with fuzzy binary 
and multiarity relations between objects, transition to 
fuzzy hypergraphs, which combine advantages both 
fuzzy and graph models, is more natural. It allows to 
realise formal optimisation and logical procedures. 

However, using of the fuzzy graphs and hypergraphs 
as the models of various systems (social, economic 
systems, communication networks and others) leads 
to difficulties. The graph isomorphic transformations 
are reduced to redefinition of vertices and edges. This 
redefinition doesn't change properties the graph deter- 
mined by an adjacent and an incidence of its vertices 
and edges. 

Fuzzy independent set, domination fuzzy set, fuzzy 
chromatic set are invariants concerning the isomorphism 
transformations of the fuzzy graphs and fuzzy hyper- 
graph and allow make theirs structural analysis. 



BACKGROUND 

The idea of fuzzy graphs has been introduced by 
Rosenfeld in a paper in (Zadeh, 1975), which has also 
been discussed in (Kaufmann, 1977). 

The questions of using fuzzy graphs for cluster analy- 
sis were considered in (Matula, 1 970, Matula, 1 972). The 



questions of using fuzzy graphs in Database Theory 
were discussed in (Kiss, 1 99 1 ). The tasks of allocations 
centers on fuzzy graphs were considered in (Moreno, 
Moreno & Verdegay, 2001, Kutangila-Mayoya & 
Verdegay, 2005, Rozenberg & Starostina, 2005). The 
analyses and research of flows and vitality in transporta- 
tion nets were considered in (Bozhenyuk, Rozenberg & 
Starostina, 2006). The fuzzy hypergraph applications 
to portfolio management, managerial decision making, 
neural cell-assemblies were considered in (Monderson 
& Nair, 2000) . The using of fuzzy hypergraphs for deci- 
sion making in CAD-Systems were also considered in 
(Malyshev, Bershtein & Bozhenyuk, 1991). 



MAIN DEFINITIONS OF FUZZY GRAPHS 
AND HYPERGRAPHS 

This article presents the main notations of fuzzy graphs 
and fuzzy hypergraphs, invariants of fuzzy graphs and 
hypergraphs. 

Fuzzy Graph 

Let a fuzzy direct graph G = (X,U) is given, where 
X is a set of vertices, u = {\i u (x i> x J )\(x i ,x J ) eX 2 } is 
a fuzzy set of edges with the membership function \i v 
: X 2 -► [0,1] (Kaufmann, 1977)^ 

Example 1. Let fuzzy graph G hasX={x ,x ,x ,x }, 
and U ={<0.5/(x ,x 2 )>, <0.6/(x ,x )> , <0.3/(x ,x )> , 
<0.2/(x ,x )> , <l/(x ,x )>}. It is presented in figure 
1. 4 ^ 

The fuzzy graph G may present a fuzzy dependence 
relation between objects x , x , x , end x . If the object 

12 3 4 

x fuzzy depends from the object x , then there is direct 
edge (x ,x ) with membership function [i v (x., x). 

If a fuzzy relation, presented by fuzzy graph G , is 
symmetrical, we have the fuzzy nondirect graph. 
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Figure 1. 




If a number of vertices n>3 and x. = x , then the 

i rrv 

path is called a cycle. 

Obviously, what is it definition coincides with the 
same definition for nonfuzzy graphs. 

Vertex y is called fuzzy accessible from vertex x in 
the graph G = (X,U) if exists a fuzzy directed path 
from vertex x to vertex y. 

The accessible degree of vertex y from vertex x, 
(x^y) is defined by expression: 

T<x,y) = max (]a(L a (x,y)), a=l,2,...,p, 




A fuzzy graph G = (X,U) is convenient for rep- 
resenting as fuzzy adjacent matrix \\r..\\ , where r.. 
= \ijix., x). So, the fuzzy graph, presented in figure 1, 
may be consider by adjacent matrix: 







x l 


x 2 


*3 


X 4 




*i 





0,5 








R x 


= x 2 








0,6 







*i 











0.3 




*A 


0,2 


1 









The fuzzy graph H = (X', U') is called a fuzzy 
subgraph (Monderson & Nair, 2000) of G = (X,U) 
if X'cX andU'cU. 

Fuzzy directed path (Bershtein & Bozhenyuk, 2005) 

L(x if x m ) of graph G = (X,U) is called the sequence 
of fuzzy directed edges from vertex x. to vertex x m : 

L(x p x m ) =< ^(XpX^XpXj) >,<]i lJ (x ] ,x k )/(x ] ,x k ) >,... 

Conjunctive strength of path ja(L(x i5 x m )) is de- 
fined as: 



l^(LM m )) = & ~ Mu <x a^ x p > 

<x a ,x p >?L(x i ,v) 



Fuzzy directed path L(x p x m ) is called simple 
path between vertices x. and x m if its part is not a path 
between the same vertices. 



where p - number of various simple directed paths from 
vertex x to vertex y. 

A subset of vertices X' is called a fuzzy independent 
vertex set (Bershtein & Bozhenuk, 2001) with the 
degree of independence 

a(X')=l- max {^(x^x^}. 

A subset of vertices I'clof graph G is called a 
maximal fuzzy independent vertex set with the degree 
a(X% if the condition a(X") < a(X') is true for any X' 
czX". 

Let a set T={X kl , X k2 ,... f XJ be given where X k . is 
a fuzzy independent /(-vertex set with the degree of 
independent a ki . We define as 

a fc ma * =tt!a*{a Yl ,a v2 a v ,} 

The value a fc mJOC means that fuzzy graph G includes k- 
vertex subgraph with the degree of independent a fc maA 
and doesn't include k-vertex subgraph with the degree 
of independence more than oc h max . 
A fuzzy set 



^={<a I ""/l>,<af"/2> <a; a Vn>} 



is called a fuzzy independent set of fuzzy graph G . 
Fuzzy graph G , presented in figure 1 , has seven 
maximum fuzzy independent vertex sets: 

% = {x 2 },%={x 4 },% = {x lf x 3 } 

with the degree of independence 1; ^ 4 = {x ly x 4 } with 
the degree of independence 0,8; x ¥ s = {x 1? x 3 ,x 4 } with 
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the degree of independence 0,7; *F 6 = {x 1? x 2 } with the 
degree of independence 0,5 and % = {x ly x 2 , x 3 } with 
the degree of independence 0,4. So, its fuzzy indepen- 
dent set is defined as 

¥v ={<!/!>,< 1/2 >,< 0,7/3 >,< 0/4 >} 



is called a separation degree of fuzzy graph G with 
k colors. 

The fuzzy graph G may be colored in a number of 
colors from 1 to n. In this case the separation degree L 
depends of the number of colors. For the fuzzy graph 
G we relate a family of fuzzy sets 



Let X' be an arbitrary subset of the vertex set X. For 
each vertex y el\l'we define the value: 



Y(y) = max{|j^Cy,x)} 



The set X' is called a fuzzy dominating vertex set 
for vertex y with the degree of domination y(y). 

The set X' is called a fuzzy dominating vertex set 
for the graph G with the degree of domination 



P(A") = min max^Cy,*)}. 

y^X\X' xgX' 



A subset X' cz X of graph G is called a minimal 
fuzzy dominating vertex set with the degree fi(X') if 
the condition p(X") < p(X) is true for any subset X" 

Let a set T={X kl , X k2 ,. . .,XJ be given, where X w is a 
fuzzy dominating k- vertex set with the degree of domi- 
nation p... We define as Pf 1 = max{p .,p„ 2 ,..., p,}. 
In the case x = we define P.T 1 = P r 1 . Volume B. MI " 
means that fuzzy graph G includes k- vertex subgraph 
with the degree of domination (3 fc MiM and doesn't include 
k- vertex subgraph with the degree of domination more 
than (3f. 

Afuzzyset B x = {< b x min /1>,< b 2 min /2 >,...,< b n min /n >} 
is called a domination fuzzy set of fuzzy graph G 
(Bershtein & Bozhenuk, 2001 a). 

Fuzzy graph G (Figure 1) has five fuzzy minimal 
dominating vertex sets: P ± = {x lf x 2 ,x 3 } with the degree 
of domination 1; P 2 = {x lf x y x 4 } with degree of domi- 



nation 0,6; P 3 = {x 2 ,x 3 } with the degree of domination 
0,5; P 4 = {x 1? x 3 } with degree of domination 0,2 and 
P 5 = {x 2 ,x 4 } with the degree of domination 0,3. A 
domination fuzzy set of fuzzy graph G is defined as 
B x = {< 0/1 >,< 0,5/2 >,< 1/3 >,< 1/4 >} . 
A value 



L= 8^a. = &_(1- 

i=l,k i=l,k 



v ji G (x,y)) 

x,yeXj 



M = {A G }, A G ={<L x (k)/k|k = l,n} 

where L~ (k) defines a degree of separation of fuzzy 
graph G with k colors. 

A fuzzy set y = {< L~(k) / k|k = l,n} is called 
a fuzzy chromatic set of graph G if the condition 
A G ^y is performed for any set A G <e9?, or else: 
( VA G G SR)( Vk = 1^)[L A (k) < L~ (k)] (Bershtein & 
Bozhenuk, 2001b). 

Otherwise, the fuzzy chromatic set defines a maxi- 
mal separation degree of fuzzy graph G with k= 1 , 
2,..., n colors. 

For fuzzy graph G (Figure 1) the fuzzy chromatic 
setisy(G) = {<0/l>,<0,5/2>,<l/3>}. 

So, the fuzzy graph G may be colored 

by one color with the degree of separation 0. In 
other words, there is at least pair of vertices x. and 
x. for which the membership function \ijix., x) = 
1. In our graph, these vertices are x 4 and x 2 ; 
by 2 colors with the degree of separation 0,5 
(vertices x 1? x 2 - first color, vertices x 3 , x 4 - second 
color). In other words, between vertices of the 
same color there aren't edges with the member- 
ship function more than 0,5; 
by 3 colors with the degree of separation 1 (vertices 
x 1? x 3 , - first color, vertices x 2 - second color, vertex 
x 4 - third color). In other words, between vertices 
of the same color there aren't any edges. 

Fuzzy Hypergraph 

Let a fuzzy hypergraph H =(X,E) be given, where 
X-{x.j, iel={l,2,...,n} - is a finite set and E = {e k }, 
e k ={<ji ek (x)/x>}, keK={l,2,...,m} is a family of 
fuzzy subsets in X (Monderson & Nair, 2000, Bersh- 
tein & Bozhenyuk, 2005). Thus elements of set X are 
the vertices of hypergraph, a family E is the family 
of hypergraph fuzzy edges. The value ji e (x)e [ 0,1] is 
an incidence degree of a vertex x to an edge e k . 
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It is possible to see that a fuzzy hypergraph is turned 
in the fuzzy graph when 1 <| e k |< 2, k e K . 

Vertices x andy are called fuzzy adjacent vertices 
if there are some edge, which includes both vertices. 
In this case a value 



iX^y) = v^ (x)&^i (y) 

e k ?E 



is called an adjacent degree of two vertices x andy of 
fuzzy hypergraph H. 

Two edges e i and e ; . are called fuzzy adjacent edges 
if e. n e. ^0. In this case a value 



L, 



& a ; = <^(1- v iix,y)) 

i=l,k i=l,k x,yeXj 




is called a separation degree of fuzzy hypergraph H 
at its k-colorings (Bershtein, Bozhenuk & Rozenberg, 
2005).. 

Fuzzy hypergraph H can be colored in any number 
of k colours and thus separation degree L depends on 
their number. Fuzzy hypergraph H we shall put in 
conformity family of fuzzy sets 

5*={A g } ? A~ = {<L(k)lk | k =U} y 



|i(e i ,e)= V U ..-.(*) 

J xe(e.ne.) i J 

is called adjacent degree of edges e { and e y 

A fuzzy hypergraph H = (X,E) is convenient for 
representing as fuzzy incidence matrix Ik II , where 
r i ~ rT l. \ x i) . So, any matrix, which elements are in- 
cluded in the interval [0,1], may be consider as fuzzy 
incidence matrix of some fuzzy hypergraph. 

A fuzzy simple path C(x 1 ,x q+1 ) with the length q 
is defined as the sequence 

C(x 1? x q+1 ) = (x 1? ^(x 1 ) ? e 1 ^i c ,(x 2 ) ? x 2 | ? i e2 (x 2 ), 

e 2 > • • -y e q •> M'e ( X q+l)' X q+l), 

where all vertices x 1 ,...,x q e X and all edges 
q, . . ., e q e E are different. 

A strength of fuzzy simple path is the weakest 
of adjacent degrees, which are included in this path 
C(x 1? x +1 ) . If two vertices x 1 and x ^ are connected 
by paths C l ,C 2 ,...C t with strengths m ,|u 2 , . . .\^ t , then 
say that vertices x 1 and x +1 are fuzzy connected by the 
strength |Li(x 1? x q+1 ) = ^^ 2 V... V\i t . 

An internal stability degree of vertices subset X' of 
fuzzy hypergraph H is determined as: 



a x , =l-max jo(x,y). 

x,yeX' 



Subset X' c X is called a maximal fuzzy internally 
stable set with the degree of internal stability a x „ if the 
statement (VX" d X')(a x „ < a x ,) is true. 

Let's paint each vertex xeX of hypergraph H in 
one of k colours (l<k<n) and we shall consider a X, 
subset of vertices, colored identically. 

The value 



where L(k) determines a separation degree of fuzzy 
hypergraph H at its certain k - colouring. 

Fuzzy set f = {<L~ (k)/k | k = l,n} is called 
fuzzy chromatic set of hypergraph H, if for any 
other set A~ e $1, it is true A fi c y . In other words, 
(VA fi e^$Vk=MiL(/c)<L(/c] . Or, otherwise, 
fuzzy chromatic set of hypergraph H determines the 
greatest separation degrees at colouring its tops in one 
of 1,2. ..n colours. 

Let H be a fuzzy hypergraph which the incidence 
matrix is given by: 



1,8 

i 

:, 0,4 






0,5 























1 


0,3 


0,7 





0,6 


0,4 


0,2 


1 





0,7 


1 














0,4 



The fuzzy chromatic set for the fuzzy hypergraph 



is 



fj = {< 0,2 /l >,< 0,5 / 2 >,< 1 / 3 >l 

Otherwise, the fuzzy hypergraph may be colored 
by one color with the degree of separation 0,2; by 2 
colors with the degree of separation 0,5 (vertices x 2 , x 3 
and x 6 - first color, vertices x ±9 x 4 h x 5 - second color); 
by 3 colors with the degree of separation 1 (vertices x 2 , 
and x A - first color, vertices x iy x c and x K - second color, 

4 J loo J 

vertex x - third color). 
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FUTURE TRENDS 

In according to a principle of generalization L. Zadeh, 
the theory of fuzzy graphs and fuzzy hypergraphs will 
develop in a development course of nonfuzzy graphs, 
hypergraphs, and fuzzy sets theory. 



CONCLUSION 

When we consider fuzzy graphs and fuzzy hypergraphs, 
there is an opportunity to relate any set vertices and 
edges to family of partial graphs and hypergraphs 
with given property. For example, a sequence of edges 
- to family of graph paths; a sequence of vertices and 
edges - to family of bipartite graphs, and so on. It 
enables to define new properties of fuzzy graphs and 
hypergraphs, and to use theirs to analysis and synthesis 
fuzzy systems. 
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KEY TERMS 

Binary Relation: A binary relation R from a set A 
to a set B is a subset of AxB. 

Binary Symmetric Relation: A relation R on a set 
A is symmetric if for all x,yeA xRy^yRx. 

Fuzzy Set: A generalization of the definition of 
the classical set. A fuzzy set is characterized by a 
membership function, which maps the member of the 
universe into the unit interval, thus assigning to ele- 
ments of the universe degrees of belongingness with 
respect to a set. 
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Graph: A graph G = (V, E) is a mathematical 
structure consisting of two finite sets V and E. The 
elements of V are called vertices (or nodes), and the 
elements of E are called edges. Each edge has a set of 
one or two vertices associated to it, which are called 
its endpoints. 

Graph Invariant: A property of a graph that is 
preserved by isomorphisms. 

Isomorphic Graphs: Two graphs that have a struc- 
ture-preserving vertex bijection between them. 

Hypergraph: A hypergraph on a finite set 
X={x lf x 2 ,. . .,xj is a family H={E p E 2 ,. . .,EJ of subsets 
oiX such thatE .^0 and 



\J Ei =X 




Membership Function: The membership function 
of a fuzzy set is a generalization of the characteristic 
function of crisp sets. 

Multiarity Relation: A multiarity relation R 
between elements of sets A, B, ..., C is a subset of 

AxBx...xC. 
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INTRODUCTION 

Ever since Zadeh established the basis of fuzzy logic 
in his famous article Fuzzy Sets (Zadeh, 1965), an 
increasing number of research areas have used his 
technique to solve and model problems and apply it, 
mainly, to control systems. This proliferation is largely 
due to the good results in classifying the ambiguous 
information that is typical of complex systems. Suc- 
cess in this field has been so overwhelming that it 
can be found in many industrial developments of the 
last decade: control of the Sendai train (Yasunobu & 
Miyamoto, 1985), control of air-conditioning systems, 
washing machines, auto-focus in cameras, industrial 
robots, etc. (Shaw, 1998) 

Fuzzy logic has also been applied to computerized 
image analysis (Bezdek & Keller & Krishnapuram & 
Pal, 1999) because of its particular virtues: high noise 
insensitivity and the ability to easily handle multidimen- 
sional information (Sutton & Bezdek & Cahoon, 1 999), 
features that are present in most digital images analyses. 
In fuzzy logic, the techniques that have been most often 
applied to image analysis have been fuzzy clustering 
algorithms, ever since Bezdek proposed them in the 
seventies (Bezdek, 1973). This technique has evolved 
continuously towards correcting the problems of the 
initial algorithms and obtaining a better classification: 
techniques for abetter initialization of these algorithms, 
and algorithms that would allow the evaluation of the 
solution by means of validity functions. Also, the clas- 
sification mechanism was improved by modifying the 
membership function of the algorithm, allowing it to 
present an adaptative behaviour; recently, kernel func- 
tions were applied to the calculation of memberships. 
(Zhong & Wei & Jian, 2003) 

At the present moment, applications of fuzzy logic 
are found in nearly all Computer Sciences fields, it con- 
stitutes one of the most promising branches of Artificial 



Intelligence both from a theoretic and commercial point 
of view. A proof of this evolution is the development 
of intelligent systems based on fuzzy logic. 

This article presents several fuzzy clustering al- 
gorithms applied to medical images analysis. We also 
include the results of a study that uses biomedical images 
to illustrate the mentioned concepts and techniques. 



BACKGROUND 

Fuzzy logic is an extension of the traditional binary 
logic that allows us to achieve multi-evaluated logic 
by describing domains in a much more detailed manner 
and by classifying better through searches in a more 
extensive area. Fuzzy logic makes it possible to model 
the real world more efficiently: for example, whereas 
binary logic merely allows us to state that a coffee is 
hot or cold, fuzzy logic allows us to distinguish be- 
tween all the possible temperature fluctuations: very 
hot, lukewarm, cold, very cold, etc. 

Techniques based on fuzzy logic have proven to be 
very useful for dealing with the ambiguity and vagueness 
that are normally associated to digital images analysis. 
At what grey level do we fixate the thresholding? Where 
do we locate the edge in blurred objects? When is a 
grey level high, low, or average? 

The fuzzy processing of digital images can be 
considered a totally different focus with respect to the 
traditional computerized vision techniques. It was not 
developed to solve a specific problem, but describes a 
new class of image processing techniques and a new 
methodology to develop them: fuzzy edge detectors, 
fuzzy geometric operators, fuzzy morphological op- 
erators, etc. 

These features make fuzzy logic especially useful 
for the development of algorithms that improve medi- 
cal images analysis, because it provides a framework 
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for the representation of knowledge that can be used 
in any phase of the analysis. (Wu & Agam & Roy & 
Armato, 2004) (Vermandel & Betrouni & Taschner & 
Vasseu & Rosseau, 2007) 



FUZZY CLUSTERING ALGORITHMS 
APPLIED TO BIOMEDICAL IMAGE 
ANALYSIS 

Medical imaging systems use a series of sensors that 
detect the features of the tissues and the structure of 
the organs, which allows us, depending on the used 
technique, to obtain a great amount of information and 
images of the area from different angles. These virtues 
have converted them into one of the most popular sup- 
port techniques in diagnosis, and have given rise to 
the current distribution and variety in medical images 
modalities (X-Rays, PET ...) and to new modalities 
that are being developed (fMRI). 

The complexity of the segmentation of biomedical 
images is entirely due to its characteristics: the large 
amount of data that need to be analyzed, the loss of 
information associated to the transition from a 3D 
body to a 2D representation, the great variability and 
complexity of the shapes that must be analyzed . . . 
Among the most frequently applied focuses to seg- 
ment medical images is the use of pattern recognition 
techniques, since normally the purpose of analyzing a 
medical digital image is the detection of a particular 
element or object: tumors, organs, etc. 

Of all these techniques, fuzzy clustering techniques 
have proven to be among the most powerful ones, be- 
cause they allow us to use several features of the dataset, 
each with their own dimensionality, and to partition 
these data; also, they work automatically and usually 
have low computational requirements. Therefore, if 
the problem of segmentation is defined as the partition 
of the image into regions that have a common feature, 
fuzzy clustering algorithms carry out this partition 
with a set of exemplary elements, called centroids, and 
obtain a matrix of the size of the original image and 
with a dimensionality equal to the number of clusters 
into which the image was divided; this indicates the 
membership of each pixel to each cluster and serves 
as a basis for the detection of each element. 

In the next section we present a series of fuzzy 
clustering algorithms that can be considered to reflect 
the evolution in this field and its various viewpoints. 



Finally, these algorithms will be used in a study that 
shows the use and possibilities of fuzzy logic in the 
analysis of biomedical images. 

Fuzzy C-Means (FCM) 

The FCM algorithm was developed by Bezdek (Bezdek, 
1973) and is the first fuzzy clustering algorithm; it ini- 
tially needs the number of clusters in which the image 
will be divided and a sample of each cluster. The steps 
of this algorithm are the following: 

1. Calculation of the membership of each element 
to each cluster: 




uk(i,j)- 



y ||yO»i)-Hl m " 1 



v 



'|[y(U)-vj|| 



J 



(1) 



2. Calculation of the new centroids of the image: 

Huk(i, j) m y(i, j) 
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(2) 
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3. If the error stays below a determined threshold, 
stop. In the contrary case, return to step 1. 

The parameters that were varied in the analysis 
of the algorithm were the provided samples and the 
value of m. 

Fuzzy K-Nearest Neighbour (FKNN) 

The Fuzzy K-Nearest Neighbour (Givens Jr. & Gray & 
Keller, 1992) is, as its name indicates, a fuzzy variant 
of a hard segmentation algorithm. It needs to know 
the number of classes into which the set that must be 
classified will be divided. 

The element that must be classified is associated 
to the class of the nearest sample among the K most 
similar ones. These K most similar samples are known 
as "neighbours"; if, for instance, the neighbours are 
classified from more to less similar, the destination 
class of the studied element will be the class of the 
neighbour that is first on the list. 

We use the expression in Equation 3 to calculate 
the membership factors of the pixel to the considered 
clusters: 
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where u.. represents the membership factor of the j-th 
sample to the z'-th class; x. represents one of the K samples 
that are most similar to the treated pixel; x represents 
the pixel itself; m is a weight factor of the distance 
between the pixel and the samples and u.(x) represents 
the level of membership of the pixel x to class z. 

During the analysis of this algorithm, the parameters 
that varied were the samples provided as initial centroids 
and the considered number of neighbours. 

Modified Fuzzy C-Means 

This algorithm is based on the work of Young Won Lim 
and Sang Uk Lee (Lee & Lim, 1990), who describe an 
algorithm for the segmentation of color images through 
the study of the histograms of each color band. This 
algorithm also relies on the classification algorithm 
fuzzy c-means. 

The MFCM consists of two parts: 

1. A hard part that studies the histograms of an 
image in order to obtain the number of classes, 
and carries out a first global classification of the 
image; and 

2. A fuzzy part that classifies the pixels that have 
more difficulties in determining the class to which 
they belong. The pixels of this area are called 
"fuzzy zone". 

Once obtained the initial clusters with its centroids, 
the algorithm uses the FCM membership function (Eq. 
2) to classify the pixels. The fuzzy points are pixels 
between the initial clusters and pixels of clusters too 
little for its consideration. 

Since we do not dispose of labeled simples of each 
class, we use the gravity centers of the clusters to cal- 
culate the membership factors of a pixel. 

During the analysis of this algorithm, we varied the 
value of the sigma used to smoothen the histogram, the 
area that the initial clusters need to survive, and the 
security areas around the clusters. 



Kernelized Fuzzy C-Means (KFCM) 

This algorithm was proposed by Wu Zhong-Dong et 
al (Zhong & Wei & Jian, 2003) and is based on FCM, 
integrated with a kernel function that allows the transfer 
of the data to a space with more dimensionality, which 
makes it easier to separate the clusters. 

The most often used kernel functions are the poli- 
nomial functions (Eq. 4) and the radial base functions 
(Eq. 5). 

K(X,Y)=$(x}<$>(y)=(X .Y + bJ (4) 

K(x,Y)=^(x)<\,(Y)=exp( c (x -Yj /2o 2 ] (5) 

The algorithm consists of the following steps: 
1. Calculation of the membership function: 

|/(q-i) 

(6) 
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where 

d 2 (x j ,V k )=K(x j ,X j )-2K(x j ,V k yK(V k ,V k ) 

2. Calculation of the new kernel matrix K(x r v k ] 
and K$J k ]: 

K<xAU(xM k > r ^J^; x > ] 

2*>J (7) 

where 






3. Update the memberships u. k to u k by means of 
Equation 6. 

4. If the error stays below a determined threshold, 
stop. In the contrary case, return to step 1. 

The different parameters for the analysis of this 
algorithm were the initial samples. 
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Images Used in the Study 

For the selection of the images that were used in the 
study (Gonzalez & Woods, 1996), we applied the 
traditional image processing techniques and used the 
histogram as basic tool. See Figure 1 . 

We observed that the pictures presented a high level 
of variation, because it was not possible to standardize 
the different elements that have a determining effect on 
them: position of the patient, luminosity, etc. We selected 
the pictures on the basis of a characteristic trait (bad 
lighting, presence of strange objects, etc.) or on their 
"normality" (correct lighting, good contrast, etc.). The 



images were digitalized to a size of 500x500 pixels and 
24 color bits per pixel, using an average scanner. 

The histograms of Figure 1 show some of the char- 
acteristics that were present in most photographies. 
The bands with a larger amount of pixels are those 
of the colors red and green, because of the color of 
the skin and the fact that green is normally used in 
sanitary tissue. 

The histogram is continuous and presents values in 
most levels, which leads us to suppose that the value 
of most points is determined by the combination of 
the three bands instead of only one band, as was to be 
expected. This complicates the analysis of the image 
with algorithms. 




Figure 1. Photograph that was used in the study, and histogram for each color band 
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Results 

The test images were divided into 3 clusters: back- 
ground, healthy tissue, and burned tissue. These areas 
are clearly distinguished by the specialist, which allows 
us to build better masks to evaluate the success rate in 
pixel detection applied to burn wounds. 

The success rate of the fuzzy clustering algorithms 
was first measured with Zhang's RUM A (Relative 
Ultimate Measurement Accuracy) (Zhang, 1996). The 
purpose of RUMA is to measure the quality of the 
segmentation in terms of the similarity of the measures 



carried out on the segmented image and on the real 
image (Eq. 8). 



R — S 
RUMA = — - xl00% 



R 



f 



(8) 



In our study, we measured the success rate by com- 
paring the number of pixels of the burned area in the 
result image that coincided with pixels of the burned 
area in the mask. 



Figure 2. Best results for the RUMA and global measurements for the: FKNN algorithm (a) and MF CM algo- 
rithm (b) 
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We also opted for applying a second success rate 
measurement, because although RUMA provides a 
value for the area of interest, it may not detect certain 
classification errors that can affect the resulting image. 
We use a measure that was developed by our research 
team and measures the clustering algorithm's perform- 
ance in classifying all the pixels of the image (Eq. 9). 
During the development of the measure, we supposed 
that the error would be smaller if the error of each cluster 
classification were smaller, so we measured the error 
in the pixel classification of each cluster and weighed 
it against the number of pixels of that cluster. 



error = 



F , 






,l*J 



(9) 



F is the number of clusters that belong to cluster 
j and were assigned to cluster z, MASC. is the total 
amount of pixels that belong to class j, and n is the 
amount of clusters into which the image was divided. 
The value of this measurement lies between and n; in 
order to simplify its interpretation, it was normalized 
between and 1. 

The graphics are simplified by inverting the dis- 
crepancy values: the higher the value, the better the 
result. 



Figure 2(a) shows the best results for the FKNN 
algorithm, varying the number of samples and neigh- 
bours from 1 sample per cluster to 8 samples, for both 
measurements. 

Figure 2(b) shows the results for the MFCM al- 
gorithm, varying the threshold that was required for 
each area in the histogram and the sigma, for both 
measurements. 

The FCM and FKCM algorithms are not detailed 
because the parameters that were varied were the value 
of the provided samples and the stop threshold, with 
a rather stable result for both measurements. In the 
Figure 3 we can see one of the results obtained for the 
algorithm FCM and the imaged labeled Ql. 

Figure 4(a) shows the results for the various images 
of the test set for RUMA applied to all the algorithms, 
Figure 4(b) shows the results using global measure- 
ment. 

The tests reveal great variation in the values provided 
for the different algorithms by each measurement; this 
is due to the lack of homogeneous conditions in the 
acquisition of the images and the ensuing differences 
in photographic quality. 

We can also observe that the results obtained with 
FKCM are considerably better than the results with 
FCM, because the first uses a better function to cal- 
culate the pixel membership. Nevertheless, for most 




Figure 3. Image labeled Ql (left) and one of the results obtained for the FCM algorithm (right) 
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Figure 4. Best results for the burned area using: RUMA measurement (a) and global success rate measurement 
(b) 



100 



30 



GO 



40 



20 



UflHI 



Ql 



Q2 



Q3 



Q4 



Q5 



Q6 



FCM FKCM BFKNN BMFCM SRFCM 
(a) 



100 



SO 



60 



40 



20 



luUMU 



Ql 



Q2 



Q3 



Q4 



Q5 



Q6 



FCM FKCM «FKNN BMFCM SRFCM 
(b) 



pictures the good results with FKCM are surpassed 
by the FKNN and MFCM algorithms. In the case of 
FKNN, this is due to its capacity to use several samples 
for each cluster, which allows a more exact calculation 
of the memberships and less error probability. MFCM, 
on the other hand, carries out a previous analysis of the 
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histogram, which enables it in most cases to find good 
centroids and make good classifications. 

Even though the FKNN algorithm obtains better 
results, in most cases it requires a high number of 
samples (more than 4), which may disturb the medical 
expert and complicate the implantation in real clini- 
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cal environments. This problem does not apply to the 
MFCM algorithm, which calculates the samples itself; 
however, its success values greatly vary, and for many 
images we had to finetune the parameters in order to 
obtain good results. 



FUTURE TRENDS 

The field of fuzzy logic is a field that evolves con- 
tinuously and is increasingly applied to industrial 
products. 

The medical images analysis field is among the 
most active in computerized vision and represents an 
important challenge to researchers in search of new 
technological developments. 

Fuzzy clustering algorithms constitute one of the 
most useful and interesting branches of fuzzy logic. 
Their use is expected to increase and new algorithms 
will appear that will provide ever better results. These 
algorithms will more and more often be applied to the 
field of medical images, where they allow us to handle 
new multidimensional modalities and improvements. 



CONCLUSION 

This article presents the results obtained by various 
fuzzy clustering algorithms in analyzing a set of burn 
wound pictures. The studied techniques obtain a high 
level of detection in the burned area and as such show 
their capacity to analyse this type of medical images. 
Testing however reveals a high degree of variation 
in the values provided by each algorithm, due to the 
absence of homogeneous conditions during the image 
acquisition and the ensuing differences in the quality 
of the pictures. 

This study shows how the FKCM algorithm provides 
the best results with the smallest amount of parameters. 
However, if we could control the context in which the 
photographs are taken, the best algorithm would be 
MFCM, which provides better results and operates 
automatically. 

Also, we revise the state of the art in the field of 
fuzzy logic and clustering algorithms, in order to 
show the characteristics of these techniques and their 
possibilities. 
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KEY TERMS 

Fuzzification: The process of decomposing a sys- 
tem input and/or output into one or more fuzzy sets. 
Many types of curves can be used, but triangular or 
trapezoidal shaped membership functions are the most 
common. 

Fuzzy Algorithm: An ordered sequence of instruc- 
tions which may contain fuzzy assignments, condi- 
tional statements, repetitive statements, and traditional 
operations. 



Fuzzy Inference Systems: A sequence of fuzzy 
conditional statements which may contain fuzzy as- 
signment and conditional statements. The execution of 
such instructions is governed by the compositional rule 
of inference and the rule of preponderant alternative. 

Fuzzy Operator: Operations that enable us to com- 
bine fuzzy sets. A fuzzy operator combines two fuzzy 
sets to give a new fuzzy set. The most frequently used 
fuzzy operators are the following: equality, contain- 
ment, complement, intersection and union. 

Medical Image: A medical specialty that uses x- 
rays, gamma rays, high-frequency sound waves, and 
magnetic fields to produce images of organs and other 
internal structures of the body. In diagnostic radiology 
the purpose is to detect and diagnose disease, whereas 
in interventional radiology, imaging procedures are 
combined with other techniques to treat certain diseases 
and abnormalities. 

Membership Function: Gives the grade, or degree, 
of membership within the fuzzy set, of any element of 
the universe of discourse. The membership function 
maps the elements of the universe onto numerical 
values in the interval [0, 1]. 

Segmentation: A process that partitions a digital 
image into disjoint (non-overlapping) regions, using 
a set of features or characteristics. The output of the 
segmentation step is usually a set of classified elements, 
such as tissue regions or tissue edges. 



718 



719 



Fuzzy Logic Estimator for Variant SNR 
Environments 




Rosa Maria Alsina Pages 

Universitat Ramon Llull, Spain 

Claudia Mateo Segura 

Universitat Ramon Llull, Spain 

Joan-Claudi Socoro Carrie 

Universitat Ramon Llull, Spain 



INTRODUCTION 

The acquisition system is one of the most sensitive 
stages in a Direct Sequence Spread Spectrum (DS-SS) 
receiver (Peterson, Ziemer & Borth, 1995), due to its 
critical position in order to demodulate the received 
information. There are several schemes to deal with 
this problem, such as serial search and parallel algo- 
rithms (Proakis, 1995). Serial search algorithms have 
slow convergence time but their computational load 
is very low; on the other hand, parallel systems con- 
verge very quickly but their computational load is very 
high. In our system, the acquisition scheme used is the 
multiresolutive structure presented in (Moran, Socoro, 
Jove, Pijoan & Tarres, 2001), which combines quick 
convergence and low computational load. 

The decisional system that evaluates the acquisition 
stage is a key process in the overall system perform- 
ance, being a drawback of the structure. This becomes 
more important when dealing with time-varying chan- 
nels, where signal to noise ratio (called SNR) is not a 
constant parameter. Several factors contribute to the 
performance of the acquistion system (Glisic & Vucetic, 
1997): channel distorsion and variations, noise and 
interference, uncertainty about the code phase, and data 
randomness. The existence of all these variables led 
us to think about the possibility of using fuzzy logic 
to solve this complex acquisition estimation (Zadeh, 
1973). A fuzzy logic acquisition estimator had already 
been tested and used in our research group to control 
a serial search algorithm (Alsina, Moran & Socoro, 
2005) with encouraging results, and afterwards in the 
multiresolutive scheme (Alsina, Mateo & Socoro, 
2007), and other applications to this field can be found in 
bibliography as (Bas, Perez & Lagunas, 200 1 ) or (Jang, 



Ha, Seo, Lee & Lee, 1998). Several previous works 
have been focused in the development of acquisition 
systems for non frequency selective channels with fast 
SNR variations (Moran, Socoro, Jove, Pijoan & Tarres, 
2001) (Mateo & Alsina, 2004). 



BACKGROUND 

In 1964, Dr. Lofti Zadeh came out with the term fuzzy 
logic (Zadeh, 1965). The reason was that traditional 
logic could not answer to some questions with a simple 
yes or no. So, it handles the concept of partial truth. Fuzzy 
logic is one of the possibilities to imitate the working of 
a human brain, and so to try to turn artificial intelligence 
into real intelligence. Zadeh devised the technique as a 
method to solve problems for soft sciences, in particular 
those that involve human interaction. 

Fuzzy logic has been proved to be a good option 
for control in very complex processes, when it is not 
possible to produce a mathematical model. Also fuzzy 
logic isrecommendable for highly non-linearprocesses, 
and overall, when expert knowledge is desirable to 
be performed. But it is not a good idea to apply if 
traditional control or estimators give out satisfying 
results, or for problems that can be modelled in a 
mathematical way. 

The most recent works in control and estimation 
using fuzzy logic applied to direct sequence spread 
spectrum communication systems are classified into 
three types. The first group uses fuzzy logic to improve 
the detection stage of the DS-CDMA 1 receiver, and they 
are presented by Bas et al and Jang et al (Bas, Perez, & 
Lagunas, 2001)(Jang, Ha, Seo, Lee, & Lee, 1998). The 
second group uses fuzzy logic to improve interference 
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rejection, with works presented by Bas et al and by 
Chia-Chang et al (Bas, & Neira, 2003) (Chia-Chang, 
Hsuan-Yu, Yu-Fan, & Jyh-Horng, 2005). Finally, 
fuzzy logic techniques are also improving estimation 
and control in the acquisition stage of the DS-CDMA 
receiver, in works by Alsina et al (Alsina, Moran, & 
Socoro, 2005) (Alsina, Mateo, & Socoro, 2007). 



ACQUISITION ESTIMATION IN 
DS-CDMA ENVIRONMENTS 

One of the most important problems to be solved in 
direct sequence spread spectrum systems is to achieve 
a robust and precise acquisition of the pseudonoise 



sequence; this is to obtain an accurate estimation of 
its exact phase or timing position (Proakis, 1995). In 
time- varying environments this fact becomes even more 
important because acquisition and tracking perform- 
ance can heavily degrade communication demodulation 
reliability. In this work a new multiresolutive acquisi- 
tion system with a fuzzy logic estimator is proposed 
(Alsina, Mateo, & Socoro, 2007). The fuzzy logic 
estimation improves the accuracy of the acquisition 
stage compared to the results for the stability control- 
ler, through the estimation of the probability of being 
acquired, and the signal to noise ratio in the channel, 
improving the results obtained for the first fuzzy logic 
estimator for the multiresolutive structure in (Alsina, 
Mateo & Socoro, 2007). 



Figure 1. Multiresolutive adaptive structure for acquisition and tracking 
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Multi resolutive Acquisition Structure 

The aim of the multiresolutive scheme presented in 
(Moran, Socoro, Jove, Pijoan & Tarres, 2001) is to 
find the correct acquisition point in a reasonable con- 
vergence time. It gives a good trade-off between speed 
of convergence of the parallel systems and the low 
computational load of the serial search algorithms. An 
M order decimation is firstly applied to the input signal 
x[n] 2 as acquisition stage can accept uncertainties under 
the chip period, and thus to decrease the computational 
load of the acquisition stage. Once the signal x[n] is 
decimated, the resulting signal r[n] is fed into the filters 
of a multiresolutive structure (see the structure in figure 
1). Note that there are H different branches that work 
with decimated versions of the input signal, separated 
in H disjoint subspaces. Each branch has an adaptive 
FIR LMS filter of length 



N 



PG 
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The Fuzzy Logic Acquisition Estimation 

The fuzzy logic acquisition estimator has been designed 
using data of the impulsional response of all the LMS 
filters of the structure. Their values variations give 
information about the probability of being correctly 
acquired, and also about SNR ratio variations in the 
channel. In the conducted experiments, the signal space 
has been divided into four subspaces (H=4), so four 
LMS filters compose the acquisition stage. The length 
of the PN sequences is PG-127, so each filter has 
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taps to converge. This input and output variables were 
already defined in (Alsina, Mateo & Socoro, 2007), 
but the rules to be evaluated have been designed in a 
more precise way. 

Input Variables 



trained with a decimated version of the PN sequence 
(PN-DEC). 

Under ideal conditions, in a non-frequency selective 
channel with white Gaussian noise, just one of the filters 
should locally converge an impulse like Xb.[k]8[n - x], 
where b[k] is the information bit, x represents the delay 
between the input signal PN sequence and the reference 
one and X is the fading coefficient for channel distorsion. 
The algorithm is reseted every new data symbol, and 
a modulus smoothing average algorithm is applied to 
each of the LMS solutions (w.[n]) to remove the data 
randomness component b.[k] dependency, obtaining 
nonnegative and averaged impulsional responses 
(Wav.[n]). The decisional system uses a peak detection 
algorithm to find which of these filters has detected 
the signal (W con [n]) 5 and the position of the maximum 
(t) in this filter will give the coarse estimation of the 
acquisition phase. 

When the acquisition point by the decisional system 
is restored, tracking is solved with another adaptive LMS 
filter (w r [n]), which expands the search window around 
the acquisition point, using the full time resolution in- 
put signal x[n]. Thus, the estimation of the acquisition 
point (now called Q is refined by the tracking and the 
signal can be correctly demodulated. 



Four different parameters have been defined as inputs 
in the fuzzy estimator; three of them referred to the 
values of the four modulus averaged acquisition LMS 
filters (Wav.[n]), especially the LMS filter adapted to 
the decimated sequence PN-DEC (called W con [n]), 
and one about the tracking filter (w tr [n]) that refines 
the search: 

Ratio t : it is computed as the quotient of the peak 
value of the LMS filter W \n] divided into the 

con*- J 

mean value of this filter but the maximum, as 
follows: 



Ratio x = 



WL.fr] 



IW C0 >] 



N 



Ratio 2 : it is evaluated as the quotient of the peak 
value of the LMS filter W [x] divided into the 

con L J 

average of the value of the same position in the 
other three filters Wzv [n]. 



Ratio n 



W m [T] 



H-l 



£wdv,.[T] 



i=l: Wav ; *W rn 
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Ratio 3 : it is obtained as the quotient of the peak 
value of the LMS filter W [x] divided into the 

con L J 

mean value of the three other filters Wfav.fn]. 



Ratio^ 



w^Et] 



a H -i N 

f-T _ 1 " TV ^ 

n ^ i=l;Wav^W rn „ iV n=l 



Ratio x tmck : it is computed as the quotient of the 
peak value of the LMS tracking filter w^], be- 
ing ^ the most precise estimation of the correct 
acquisition point, divided into the mean value of 
the same filter but the maximum. 



Ratio 



vUj] 



l_track 



2X[n] 



N 



n=l; n^ 



These parameters have been chosen due to the 
information they contain about the probability of 
being acquired, and also about the SNR level in the 
channel and its variations. They value variations give 
good estimations about acquisition quality and a good 
measure for SNR, with the appropriate definition of 
IF-THEN rules. 

Output Variables 



The results will be obtained using a defuzzyfication 
method based on the centroid (Leekwijck & Kerre, 



Figure 2. Variable acquisition for all input variables combinations 
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1999). Two output variables will be computed. Acqui- 
sition, giving a value in the range of [0,1], being zero 
when it is Not Acquired and one if it is Acquired. Three 
more fuzzy sets have been defined between the extreme 
values; Probably Not Acquired, Not Determined and 
Probably Acquired. Acquisition will show a value of 
reliability for the correct demodulation of the detector. 
The multiresolutive scheme only gives an estimation of 
the acquisition point, andAcquisition value evaluates the 
probability of being acquired, and so, the consistency 
of the bit demodulation done by the receiver. 

The second variable is SNR Estimation which gives 
a value (in the range of [-30, 0] dBs in our experiment) 
of the estimated SNR value in the channel. SNREstima- 
tion will give us information about channel conditions; 
this will help not only in acquisition and tracking, but 



also in detection as in (Verdu, 1 998) or (Alsina, Moran 
& Socoro, 2005). 

If-Then Rules 

A total of sixty rules have been used to define the two 
outputs in function of the input values, evolving the set 
of rules used in (Alsina, Mateo & Socoro). In figure 2 
the surface for Acquisition for all input variables and 
figure 3 shows the surface for SNR Estimation for all 
inputs. Rules have been defined to take into account the 
best performance, in its range, of each input parameter 
value to design the two outputs of the fuzzy estimator. 
This means the value range is only considered where 
their estimations are more reliable for both outputs. 




Figure 3. Variable SNR Estimation for all input variables combinations 
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The most improved estimation for the output Ac- 
quisition is the correspondence to Not Determined; 
this means that the input parameters have no coherent 
values of Acquisition or Not Acquisition by themselves. 
To obtain a precise output value, the fuzzy estima- 
tor evaluates the degree of implication of each input 
parameter to the membership functions, and projects 
this implication to the fuzzy sets of the output vari- 
able Acquisition, in order to obtain its value through 
defuzzyfication. Ratio 1 and Ratio 1 tmck are the best input 
parameters to estimate Acquisition when channel condi- 
tions are good; these two parameters are supported by 
Ratio and Ratio when SNR worsen. The precision of 
the critical estimations has been improved in the design 
of the new rules for the fuzzy estimator. 

On the other hand, SNR Estimation most robust 
evaluations are made by Ratio 2 and Ratio 3 ; they are 
improved by Ratio ± tmck when SNR is high, and by 
Ratio J when SNR is very low. As can be observed in 
figure 3, these variables highly correlate with SNR 
Estimation value. 



Results 

In this section the results obtained with the new ac- 
quisition and SNR fuzzy logic estimator will be sum- 
marized. Several simulations using an Additive White 
Gaussian Noise channel (AWGN), some of them with 
very fast SNR changes, have been done to show the 
performance of the fuzzy estimator in terms of reli- 
ability and stability. 

Fuzzy Estimator Acquisition Reliability vs. 
Stability Control 

A previous acquisition estimation was obtained using 
a stability control (Moran, Socoro, Jove, Pijoan & 
Tarres, 2001), that took into account preservation of 
the acquisition point for evaluation and comparison 
purposes. It considered that the system was acquired 
only due to continuous repetitions of the acquisition 
point given by the multiresolutive scheme. This stability 
control gave a binary response about the performance 



Figure 4. % of correct estimation of acquisition using the new fuzzy estimator against the stability Control 



Fuzzy Logic Estimator vs Stability Control 
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of the system. Despite its good performance, being 
observed in figure 4, the new fuzzy approach improves 
the results for wider SNR range. The quality of the 
fuzzy acquisition estimation is much better for very low 
SNR compared to the stability control, and its global 
performance for the whole range of SNR in our tests is 
improved. The stability control is not a good estimator 
for critical SNR (considered around -15dBs), and it 
decreases its reliability when SNR decreases. Despite 
showing similar performance around critical SNR, 
the fuzzy logic estimation of Acquisition improves its 
performance for worse SNR ratios, being over 90% of 
correct estimation all the simulations along. 

Fuzzy SNR Estimation in Time Varying 
Channels 

In figure 5. a the acquisition system has been simulated 
in an AWGN channel, forcing severe and very fast SNR 
changes in order to evaluate the convergence speed of 
the SNR estimator. SNR Estimation mean value, being a 
very variable value, is obtained through an exponential 
smoothing average filter, and compared to the SNR in 
the AWGN channel. The SNR in the channel is estimated 



quite precisely until very low SNR (near -20dBs) by 
the fuzzy block, as the input parameters are not stable 
enough to make a good prediction for lower values; 
this is similar to what happens for Acquisition estima- 
tion. To observe the recovery of the fuzzy estimator 
in case of fast SNR changes in the channel, a detail of 
SNR Estimation is shown in figure 5.b. This informa- 
tion shows the channel state to the receiver, and allows 
further work to improve reliability of the demodulation 
by means of different approaches (Verdu, 1998). 

FUTURE TRENDS 

Future work will be focused on improving the estima- 
tion for the SNR in the fuzzy system. Another goal to 
be reached is to increase the stability against channel 
changes using previous detected symbols, obtaining 
a system with feedback. The fuzzy estimator outputs 
will be used to design a controller for the acquisition 
and tracking structure. Its aim will be to improve the 
stability of estimation of the correct acquisition point (£) 
through an effective and robust control of its variations 
for sudden channel changes, so memory will be added 
to the fuzzy logic estimator. This way the estimator is 




Figure 5. a) SNR estimation in a varying SNR channel; b) Detail of SNR Estimation when adapting to an in- 
stantaneous SNR variation 



SNR Estimation in a Varying SNR Channel 
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converted in a controller, and the whole performance 
of the receiver is improved. 

Further research will also take into account mul- 
tipath channel conditions and possible variations, 
including rake -based receiver detection, in order to 
reach a good acquisition and tracking performance in 
ionospheric channels. Furthermore, the reliability of the 
results encourages us to use the acquisition estimation 
to minimize the computational load of the acquisi- 
tion system for proper channel conditions, thorough 
decreasing the number of iterations to converge in 
the LMS adaptive filters. A more efficient fuzzy logic 
control can be designed in order to achieve a better 
trade-off between computational load (referred to the 
LMS filters adaptation) and acquisition point estimation 
accuracy (£). 



CONCLUSION 

The new proposed acquisition system estimator has 
already been exposed, and some results have been 
compared against a stability control strategy within 
the multiresolutive acquisition system in a variant SNR 
environment. The main advantage of a multiresolutive 
fuzzy estimator is its reliability when evaluating the 
probability of acquisition, also its stability, and its 
quick convergence when there are fast channel SNR 
changes. The computational load of a fuzzy estimator 
is higher than the same cost in a stability control. The 
mean number of FLOPS in a DSP needed to do all 
the process is greater compared to the conventional 
stability control. This has to be taken into account 
because the multiresolutive structure should make its 
computational cost minimum to work on-line with the 
received data. Further work will be done to compare 
the computational load added to the structure to the 
global improvements of the multiresolutive receiver, 
to decide whether this cost increase is affordable for 
the acquisition system, or it is not. 
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KEY TERMS 

Defuzzyfication: After computing the fuzzy rules, 
and evaluating the fuzzy variables, this is the process 
the system follows to obtain a new membership func- 
tion for each output variable. 

Degree of Truth: It denotes the extent to which a 
preposition is true. It is important to not be confused 
with the concept of probability. 

Fuzzy Logic: Fuzzy logic was derived from Fuzzy 
Set theory, working with a reason that it is approxi- 
mate rather than precise, deducted from the typical 
predicate logic. 



Fuzzy Sets: Fuzzy sets are sets whose members 
have a degree of membership. They were introduced to 
be an extension of the classical sets, whose elements' 
membership was assessed by binary numbers. 

Fuzzyfication : It is the process of defining the degree 
of membership of a crisp value for each fuzzy set. 

IF-THEN Rules: They are the typical rules used 
by expert fuzzy systems. The IF part is the anteced- 
ent, also named premise, and the THEN part is the 
conclusion. 

Linguistic Variables: They take on linguistic 
values, which are words, with associated degrees of 
membership in each set. 

Linguistic Term: It is a subjective category for a 
linguistic variable. Each linguistic term is associated 
with a fuzzy set. 

Membership Function: It is the function that gives 
the subjective measures for the linguistic terms. 



ENDNOTES 

1 DS-CDMA stands for Direct Sequence Code 
Division Multiple Access. 

2 The received signal x[n] is sampled at M sam- 
ples per chip in order to give the necessary time 
resolution for the tracking stage. 

3 where PG is the length of the pseudonoise se- 
quences, also called PN sequences and 'ceil(x)' 
(expressed as N = [x]) is the smaller integer greater 
than x. 
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INTRODUCTION 

The "fuzzy dot" (or fuzzy relation) representation of 
fuzzy rules in fuzzy rule based systems, in case of classi- 
cal fuzzy reasoning methods (e.g. the Zadeh-Mamdani- 
Larsen Compositional Rule of Inference (CRI) (Zadeh, 
1973) (Mamdani, 1975) (Larsen, 1980) or the Takagi 
- Sugeno fuzzy inference (Sugeno, 1985) (Takagi & 
Sugeno, 1985)), are assuming the completeness of the 
fuzzy rule base. If there are some rules missing i.e. the 
rule base is "sparse", observations may exist which hit 
no rule in the rule base and therefore no conclusion 
can be obtained. One way of handling the "fuzzy dot" 
knowledge representation in case of sparse fuzzy rule 
bases is the application of the Fuzzy Rule Interpola- 
tion (FRI) methods, where the derivable rules are 
deliberately missing. Since FRI methods can provide 
reasonable (interpolated) conclusions even if none of 
the existing rules fires under the current observation. 
From the beginning of 1990s numerous FRI methods 
have been proposed. The main goal of this article is 
to give a brief but comprehensive introduction to the 
existing FRI methods. 



BACKGROUND 

Since the classical fuzzy reasoning methods (e.g. the 
Zadeh-Mamdani-Larsen CRI) are demanding complete 
rule bases, the classical rule base construction claims 
a special care of filling all the possible rules. In case 
if the rule base is "sparse" (some rules are missing), 
observations may exist which hit no rule and hence 
no conclusion can be obtained. In many application 
areas of fuzzy control structures, the accidental lack 
of conclusion is hard to explain, or meaningless (e.g. 
in steering control of a vehicle). This case one obvi- 
ous solution could be to keep the last real conclusion 
instead of the missing one, but applying historical data 
automatically to fill undeliberately missing rules could 
cause unpredictable side effects. Another solution for 
the same problem is the application of the fuzzy rule 



interpolation (FRI) methods, where the derivable rules 
are deliberately missing. The rule base of an FRI con- 
troller is not necessarily complete, since FRI methods 
can provide reasonable (interpolated) conclusions 
even if none of the existing rules fires under the cur- 
rent observation. It could contain the most significant 
fuzzy rules only, without risking the chance of having 
no conclusion for some of the observations. On the 
other hand most of the FRI methods are sharing the 
burden of high computational demand, e.g. the task of 
searching for the two closest surrounding rules to the 
observation, and calculating the conclusion at least in 
some characteristic a-cuts. Moreover in some methods 
the interpretability of the fuzzy conclusion gained is 
also not straightforward (Koczy & Kovacs, 1 993). There 
have been a lot of efforts to rectify the interpretability 
of the interpolated fuzzy conclusion (Tikk & Baranyi, 
2000). In (Baranyi, Koczy & Gedeon, 2004) Baranyi 
et al. give a comprehensive overview of the recent 
existing FRI methods. Beyond these problems, some 
of the FRI methods are originally defined for one di- 
mensional input space, and need special extension for 
the multidimensional case (e.g. (Jenei, 2001), (Jenei, 
Klement & Konzel, 2002)). In (Wong, Tikk, Gedeon & 
Koczy, 2005) Wong etal. gave a comparative overview 
of the recent multidimensional input space capable FRI 
methods. In (Jenei, 2001) Jenei introduced a way for 
axiomatic treatment of the FRI methods. In (Perfilieva, 
2004) Perfilieva studies the solvability of fuzzy rela- 
tion equations as the solvability of interpolating and 
approximating fuzzy functions with respect to a given 
set of fuzzy rules (e.g. fuzzy data as ordered pairs of 
fuzzy sets). The high computational demand, mainly 
the search for the two closest surrounding rules to an 
arbitrary observation in the multidimensional anteced- 
ent space turns many of these methods hardly suitable 
for real-time applications. Some FRI methods, e.g. the 
method introduced by Jenei et al in (Jenei, Klement 
& Konzel, 2002), eliminate the search for the two 
closest surrounding rules by taking all the rules into 
consideration, and therefore speeding up the reasoning 
process. On the other hand, keeping the goal of con- 
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structing fuzzy conclusion, and not simply speeding 
up the reasoning, they still require some additional 
(or repeated) computational steps for the elements of 
the level set (or at least for some relevant a levels). 
An application oriented aspect of the FRI emerges 
in (Kovacs, 2006), where for the sake of reasoning 
speed and direct real-time applicability, the fuzziness 
of fuzzy partitions replaced by the concept of Vague 
Environment (Klawonn, 1994). 

In the followings, the brief structure of several FRI 
methods will be introduced in more details. 



FUZZY RULE INTERPOLATION 
METHODS 

One of the first FRI techniques was published by 
Koczy and Hirota (Koczy & Hirota, 1991). It is usu- 
ally referred as KH method. It is applicable to convex 
and normal fuzzy (CNF) sets in single input and 
single output (SISO) systems. The KH method takes 
into consideration only the two closest surrounding 
(flanking) rules to the observation. It determines the 
conclusion by its a-cuts in such a way that the ratio of 
distances between the conclusion and the consequents 
should be identical with the ratio of distances between 
the observation and the antecedents for all important 
a-cuts. The applied formula: 



can be solved for the required conclusion B* for rel- 
evant a-cuts after decomposition. Where A i — > B 1 and 
A, — » B 2 are the two flanking rules of the observation 
A* and d: F(X)*F(X)^>R is a distance function of fuzzy 
sets (in case of the KH method it was calculated as 
the distance of the lower and upper end points of the 
a-cuts) (see e.g. on Fig. 1.). 

It is shown in, e.g. in (Koczy & Kovacs, 1993), 
(Koczy & Kovacs, 1994) that the conclusion of the 
KH method is not always directly interpretable as 
fuzzy set (see e.g. on Fig. 1 .). This drawback motivated 
many alternative solutions. The first modification was 
proposed by Vass, Kalmar and Koczy (Vass, Kalmar 
& Koczy, 1992) (referred as VKK method), where the 
conclusion is computed based on the distance of the 
centre points and the widths of the a-cuts, instead of 
their lower and upper end point distances. The VKK 
method extends the applicability of the KH method, but 
it was still strongly depends on the membership shape 
of the fuzzy sets (e.g. it was unable to handle singleton 
antecedent sets, as the width of the antecedent's support 
must not be zero). 

In spite of the known restrictions, the KH method 
is still popular because of its simplicity. Subsequently 
it was generalized in several ways. Among them the 
stabilized KH interpolator was emerged, as it was proved 




Figure 1. KH method for two SISO rules: A i —>B 1 and A 2 ^> B 2 , conclusion y of the observation x 
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to hold the universal approximation property in (Tikk, 
Joo, Koczy, Varlaki, Moser & Gedeon, 2002) and (Tikk, 
2003). This method takes into account all the rules of 
the rule base in the calculation of the conclusion. The 
method adapts a modification of the Shepard operator 
based interpolation (Shepard, 1968). The rules are taken 
into account in extent to the inverse of the distance 
between their antecedents and the observation. The 
universal approximation property holds if the distance 
function is raised to the power of at least the number 
of the antecedent dimension. 

Another modification of KH method is the modi- 
fied alpha-cut based interpolation method (referred as 
MACI) (fully extended in (Tikk & Baranyi, 2000)), 
which alleviates completely the abnormality problem. 
MACI's main idea is the following: it transforms fuzzy 
sets of the input and output universes to such a space 
where abnormality is excluded, then computes the 
conclusion there, which is finally transformed back to 
the original space. MACI uses vector representation 
of fuzzy sets. The original method was introduced 
in (Yam & Koczy, 1997) and it was applicable for 
CNF sets only. This restriction was latter relaxed in 
(Tikk, Baranyi, Gedeon & Muresan 2001) by paying 
its expanse in higher computational demand than the 
original method. MACI is one of the most applied FRI 
methods (Wong, Tikk, Gedeon & Koczy, 2005), since it 
preserves advantageous computational and approximate 
nature of KH method, while it excludes its chance for 
abnormal conclusion. 

Another FRI method was proposed by Koczy et 
al. in (Koczy, Hirota & Gedeon, 1997). It takes into 
consideration only the two closest surrounding rules 
to the observation and its main idea is the conservation 
of the "relative fuzziness" (referred as CRF method). 
This notion means that the left (and right) fuzziness 
of the approximated conclusion in proportion to the 
flanking fuzziness of the neighbouring consequent 
should be the same as the left (and right) fuzziness of 
the observation in proportion to the flanking fuzziness 
of the neighbouring antecedent. The original method 
is restricted to CNF sets only. 

An improved fuzzy interpolation technique for 
multidimensional input spaces (referred as IMUL) was 
originally proposed in (Wong, Gedeon & Tikk, 2000), 
and described more detailed in (Wong, Tikk, Gedeon & 
Koczy, 2005). IMUL applies a combination of CRF and 
MACI methods, and mixes the advantages of both. The 
core of the conclusion is determined by MACI method, 



while its flanks by CRF (the method is restricted to trap- 
ezoidal membership functions). The main advantages of 
this method are its applicability for multi-dimensional 
problems and its relative simplicity. 

Conceptually different approaches were proposed 
in (Baranyi, Koczy & Gedeon, 2004) based on the 
relation, semantic and inter-relational features of the 
fuzzy sets. The family of these methods applies a two 
step "General Methodology" (referred as GM). The 
notation also reflects the feature, that methods based on 
GM can handle arbitrary shaped fuzzy sets. The basic 
concept is to divide the task of the FRI into two main 
steps. The first step is to determine the reference point 
of the conclusion based on the ratio of the distances 
between the reference points of the observation and the 
antecedents. Then accomplishing the first step, based 
on the existing rules a new, interpolated rule is gener- 
ated for the reference point of the observation and the 
reference point of the conclusion. In the second step of 
the method, a single rule reasoning method (revision 
function) is applied to determine the final fuzzy conclu- 
sion based on the similarity of the fuzzy observation 
and the antecedent of the new "interpolated" rule. For 
both the main steps of GM numerous solutions exists, 
therefore the GM stands for an FRI concept, or a fam- 
ily of FRI methods. 

A rather different application oriented aspect of 
the FRI emerges in the concept of the Fuzzy Inter- 
polation based on Vague Environment FRI method 
(referred as FIVE), originally introduced in (Kovacs, 
1996), (Kovacs & Koczy, 1997a), (Kovacs & Koczy, 
1 997b) and extended with the ability of handling fuzzy 
observation in (Kovacs, 2006). It was developed to 
fit the speed requirements of direct fuzzy control, 
where the conclusions of the fuzzy controller are ap- 
plied directly as control actions in a real-time system. 
The main idea of the FIVE method is based on the 
fact that most of the control applications serves crisp 
observations and requires crisp conclusions from the 
controller. Adopting the idea of the vague environment 
(Klawonn, 1 994), FIVE can handle the antecedent and 
consequent fuzzy partitions of the fuzzy rule base by 
scaling functions (Klawonn, 1994) and therefore turn 
the fuzzy interpolation to crisp interpolation. In FIVE 
any crisp interpolation, extrapolation, or regression 
method can be adapted very simply for FRI. Because 
of its simple multidimensional applicability, in FIVE, 
originally the Shepard operator based interpolation 
(Shepard, 1968) was adapted. 
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FUTURE TRENDS 

Future trends of the FRI methods include the appear- 
ance of numerous hybrid FRI methods i.e. neuro-FRI, 
genetic-FRI for (depending on the application area) 
gradient based, or gradient free parameter optimisa- 
tion of the FRI model. Future trends also directed to 
extended number of practical applications of the FRI. 
Recently a freely available comprehensive FRI toolbox 
(Johanyak, Tikk, Kovacs & Wong, 2006) and an FRI 
oriented web site (http ://fri. gamf.hu) were appeared for 
aiding and guiding the future FRI applications. 



CONCLUSION 

There are relatively few Fuzzy Rule Interpolation (FRI) 
techniques can be found among the practical fuzzy rule 
based applications. On one hand the FRI methods are 
not widely known, and some of them have limitations 
from practical application point of view, e.g. can be 
applied only in one dimensional case, or defined based 
on the two closest surrounding rules of the actual ob- 
servation. On the other hand enabling the application 
of sparse rule bases the FRI methods can dramatically 
simplify the way of fuzzy rule base creation, since FRI 
methods can provide reasonable (interpolated) con- 
clusions even if none of the existing rules fires under 
the current observation. Therefore these methods can 
save the expert from dealing with derivable rules and 
help to concentrate on cardinal actions only and hence 
simplify the rule base creation itself. Thus, compared to 
the classical fuzzy CRI, the number of the fuzzy rules 
needed to be handled during the design process, could 
be dramatically reduced (see e.g. in (Kovacs, 2005)). 
Moreover in case of parameter optimisation of the 
sparse FRI model (hybrid FRI methods), the reduced 
FRI rule base size could also means reduction in the 
size of the optimisation search space, and hence it can 
lead to quicker optimisation algorithms too. 
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KEY TERMS 

a-Cut of a Fuzzy Set: Is a crisp set, which holds 
the elements of a fuzzy set (on the same universe of 
discourse) whose membership grade is grater than, or 
equal to a. (In case of "strong" a -cut it must be grater 
than a.) 

£-Covering Fuzzy Partition: The fuzzy partition 
(a set of linguistic terms (fuzzy sets)) e-covers the 
universe of discourse, if for all the elements in the 
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universe of discourse a linguistic term exists, which 
have a membership value grater or equal to 8. 

Complete (or Dense) Fuzzy Rule Base: A fuzzy 
rule base is complete, or dense if all the input universes 
are 8-covered by rule antecedents, where 8>0. In case 
of Complete Fuzzy Rule Base, for all the possible 
multidimensional observations, a rule antecedent must 
exist, which has a nonzero activation degree. Note, that 
completeness of the fuzzy rule base is not equivalent 
with covering fuzzy partitions on each antecedent uni- 
verse (required but not sufficient in multidimensional 
case). Usually the number of the rules of a complete 
rule base is 0(M I ), where Mis the average number of 
the linguistic terms in the fuzzy partitions and I is the 
number of the input universe. 

Convex and Normal Fuzzy (CNF) Set: A fuzzy set 
defined on a universe of discourse holds total ordering, 
which has a height (maximal membership value) equal 
to one (i.e. normal fuzzy set), and having membership 
grade of any elements between two arbitrary elements 
grater than, or equal to the smaller membership grade 
of the two arbitrary boundary elements (i.e. convex 
fuzzy set). 

Fuzzy Compositional Rule of Inference (CRI): 

The most common fuzzy inference method. The fuzzy 
conclusion is calculated as the fuzzy composition (Klir 
& Folger, 1 988) of the fuzzy observation and the fuzzy 
rule base relation (see "Fuzzy dot" representation of 
fuzzy rules). In case of the Zadeh - Mamdani - Larsen 
max-min compositional rule of inference (Zadeh, 1 973) 
(Mamdani, 1975) (Larsen, 1980) the applied fuzzy 
composition is the max-min composition of fuzzy rela- 
tions ("max" stands for the applied s-norm and "min" 
for the applied t-norm fuzzy operations). 

"Fuzzy Dot" Representation of Fuzzy Rules: The 

most common understanding of the If-Then fuzzy rules. 



The fuzzy rules are represented as a fuzzy relation of 
the rule antecedent and the rule consequent linguistic 
terms. In case of the Zadeh - Mamdani - Larsen composi- 
tional rule of inference (Zadeh, 1973) (Mamdani, 1975) 
(Larsen, 1980) the fuzzy rule relations are calculated 
as the fuzzy cylindric closures (t-norm of the cylindric 
extensions) (Klir & Folger, 1988) of the antecedent and 
the rule consequent linguistic terms. 

Fuzzy Rule Interpolation: A way for fuzzy infer- 
ence by interpolation of the existing fuzzy rules based 
on various distance and similarity measures of fuzzy 
sets. A suitable method for handling sparse fuzzy rule 
bases, since FRI methods can provide reasonable (in- 
terpolated/extrapolated) conclusions even if none of the 
existing rules fires under the current observation. 

Sparse Fuzzy Rule Base: A fuzzy rule base is 
sparse, if an observation may exist, which hits no rule 
antecedent. (The rule base is not complete.) 

Vague Environment (VE) : The idea of a VE is based 
on the similarity (or in this case the indistinguishability) 
of the considered elements . In VE the fuzzy membership 
function |J A \x) is indicating level of similarity of x to a 
specific element a that is a representative or prototypical 
element of the fuzzy set H A (x), or, equivalently, as the 
degree to which x e X is indistinguishable from a e X 
(Klawonn, 1994). Therefore the ot-cuts of the fuzzy set 
|j A (x) are the sets which contain the elements that are 
1-a -indistinguishable from a. Two values in a VE 
are 8-distinguishable if their distance is greater than 
8. The distances in a VE are weighted distances. The 
weighting factor or function is called scaling function 
(factor) (Klawonn, 1994). If the VE of a fuzzy parti- 
tion (the scaling function or at least the approximate 
scaling function (Kovacs, 1996), (Kovacs & Koczy, 
1997b)) exists, the member sets of the fuzzy partition 
can be characterized by points in that VE. 
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INTRODUCTION 

The basic objective of system modeling is to estab- 
lish an input-output representative mapping that can 
satisfactorily describe the system behaviors, by using 
the available input-output data based upon physical 
or empirical knowledge about the structure of the 
unknown system. 



BACKGROUND 

Conventional system modeling techniques suggest 
constructing a model described by a set of differential 
or difference equations. This approach is effective 
only when the underlying system is mathematically 
well-defined and precisely expressible. They often 
fail to handle uncertain, vague or ill-defined physical 
systems, and yet most real -world problems do not obey 
such precise, idealized, and subjective mathematical 
rules. According to the incompatibility principle (Za- 
deh, 1973), as the complexity of a system increases, 
human's ability to make precise and significant state- 
ments about its behaviors decreases, until a threshold 
is reached beyond which precision and significance 
become impossible. Under this principle, Zadeh (1 973) 
proposed a modeling method of human thinking with 
fuzzy numbers rather than crisp numbers, which had 
eventually led to the development of various fuzzy 
modeling techniques later on. 



step is to determine the number of membership func- 
tions for each input variable. This process is closely 
related to the partitioning of input space. Input space 
partitioning methods are useful for determining such 
structures (Wang & Mendel, 1996). 

Grid Partitioning 

Figure 1 (a) shows a typical grid partition in a two- 
dimensional input space. Fuzzy grids can be used to 
generate fuzzy rules based on system input-output 
training data. Also, a one-pass build-up procedure 
can avoid the time-consuming learning process, but 
its performance depends heavily on the definition of 
the grid. In general, the finer the grid is, the better the 
performance will be. Adaptive fuzzy grid partitioning 
can be used to refine and even optimize this process. 
In the adaptive approach, a uniformly partitioned grid 
may be used for initialization. As the process goes on, 
the parameters in the antecedent membership func- 
tions will be adjusted. Consequently, the fuzzy grid 
evolves. The gradient descent method may then be 
used to optimize the size and location of the fuzzy grid 
regions and the overlapping degree among them. The 
major drawback of this grid partition method is that 
the performance suffers from an exponential explosion 
of the number of inputs or membership functions as 
the input variables increase, known as the "curse of 
dimensionality," which is a common issue for most 
partitioning methods. 



MAIN FOCUS OF THE CHAPTER 
Structure Identification 

In structure identification of a fuzzy model, the first 
step is to select some appropriate input variables from 
the collection of possible system inputs; the second 



Tree Partitioning 

Figure 1 (b) visualizes a tree partition. The tree par- 
titioning results from a series of guillotine cuts. Each 
region is generated by a guillotine cut, which is made 
entirely across the subspace to be partitioned. At the 
(k - l)st iteration step, the input space is partitioned 
into k regions. Then a guillotine cut is applied to one of 
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these regions to further partition the entire space into 
k + 1 regions. There are several strategies for deter- 
mining which dimension to cut, where to cut at each 
step, and when to stop. This flexible tree partitioning 
algorithm resolves the problem of curse of dimensional- 
ity. However, more membership functions are needed 
for each input variable, and they usually do not have 
clear linguistic meanings; moreover, the resulting fuzzy 
model consequently is less descriptive. 

Scatter Partitioning 

Figure 1 (c) illustrates a scatter partition. This method 
extracts fuzzy rules directly from numerical data (Abe 
& Lan, 1995). Suppose that a one-dimensional output, 
y, and an m-dimensional input vector, x, are available. 
First, the output space is divided into n intervals, \y Q , 
yj> (y v y 2 l •••> (y n _ v y n l where the zth interval is 
called "output interval i ." Then, activation hyperboxes 
are determined, which define the input region cor- 
responding to the output interval z, by calculating the 
minimum and maximum values of the input data for 
each output interval. If the activation hyperbox for the 
output interval z overlaps with the activation hyperbox 
for the output interval j, then the overlapped region 
is defined as an inhibition hyperbox. If the input data 
for output intervals i and/or j exist in the inhibition 
hyperbox, then within this inhibition hyperbox one or 
two additional activation hyperboxes will be defined. 
Moreover, if two activation hyperboxes are defined and 
they overlap, then an additional inhibition hyperbox 



is further defined. This procedure is repeated until 
overlapping is resolved. 

Parameters Identification 

After the system structure has been determined, pa- 
rameters identification is in order. In this process, the 
optimal parameters of a fuzzy model that can best 
describe the input-output behavior of the underlying 
system are searched by optimization techniques. 

Sometimes, structure and parameters are identi- 
fied under the same framework through fuzzy model- 
ing. There are virtually many different approaches 
to modeling a system using the fuzzy set and fuzzy 
system theories (Chen & Pham, 1999, 2006), but the 
classical least-squares optimization and the general 
Genetic Algorithm (GA) optimization techniques are 
most popular. They are quite generic, effective, and 
competitive with other successful non-fuzzy types of 
optimization-based modeling methods such as neural 
networks and statistical Monte Carlo. 

An Approach Using Least-Squares 
Optimization 

A fuzzy system can be described by the following 
generic form: 




f(*) = Ja,g,(x) = a r g(x) 



k=i 



(i) 



Figure 1. Three typical MISO partitioning methods 



u rq 



(a) fuzzy grid 



(b) tree partition 



(c) scatter partition 
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where a - [oq, • ■ -,oc m ] are constant coefficients and 






IRm3 



, k= l,-,m 



(2) 



are the basis functions, in which [i x (•) are the chosen 
membership functions. Suppose that the real system 
output is 



y(0 = £a fc g fc (x) + e(t) 



(3) 



where y(t) is the system output and e(t) represents the 
modeling error, which is assumed to be uncorrected 
with the fuzzy basis functions {g k (-)} k=1 in this 
discussion. 

Suppose that n pairs of system input-output data 
are given: (x d (t.), y d (t.)), i = l,—,n. The goal is to find 
the best possible fuzzy basis functions, such that the 
total least-squares error between the data set and the 
system outputs {yCOKU is minimized. To do so, the 
linear model (3) is first written in a matrix form over 
the time domain t ± < ~ m < t n , namely, 

y = Ga + e 

where y = [y(g, - ,y(t n )] r , e = [e(t ± ) 9 -, e(t n )] T , and 



G ^-,gJ 



9l(0 '•• QnSh) 



with g. = [g.(g, -, gft)Y,j = 1, -, n. 

The first step is to transform the set of numbers, 
g.(t), i = 1, •-, m, j = 1, ••, n, into a set of orthogonal 
basis vectors, and only significant basis vectors are used 
to form the final least-squares optimization. Here, the 
Gaussian membership functions 

Vx kj (x* ) = Cy exp{-(x k - x kj /a k ) 2 / 2} 



are used as an example to illustrate the computational 
algorithm. 

One approach to initializing the fuzzy basis func- 
tions is to choose n initial basis functions, g k (x), in the 



form of (2) with m = n in this discussion, and initially 
withc k .= l, x kj =x k (tj), and 

°kj = — [max{x fc (t ),j=l,..-,n}- 
m l 

min{x k (t.),j=l,---,n}] 

,k= 1,-, n 

where m l is the number of the basis functions in 
the final expression, which is determined by the 
designer based on experience (usually, m l < n). 

After choosing the initial fuzzy basis functions, 
the next step is to select the most significant ones 
among them. This process is based on the classical 
Gram-Schmidt orthogonalization, while c k ., x kj , and 
o, . are all fixed: 

Step 1. For j = 1, compute 



1 / .,(0\r,.,(0 



««' 



W\T % m 



e u = (W^hU* 



T 

i d y d 



(1 < i< n) where 

*d =[^(0.— ,x d (t n )f and 

y d =[yAti)>---,y d (tn)T 

are the input-output data set. Then, compute 
e™ = max{e 1 (0 : 1 < i < n} 

and let 

Wi = w| Zl) = £ z . and hj. = ^i 0l) • 
Step 2. For each j, 2 < j < m p compute 
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w 



[i) = a -Y J c (i 



w, 



h\ 



(w (0 ) T y 



' (ws°)V; 



e y J=(h yyv^ 



. m .^MK 



z,z d 



.('';-) 



max 



{e<°: 1< i<n;i **!,•••, z'*Vi} 



where ej° represents the error-reduction ratio due 
to w^ . Pick 

Wj =w ( / j) and h k = h£'\ 



Step 3. Solve equation 



A (m 'V m ' ) = h (m ' ) 

forasolutiona (m ' ) =[a 1 (m ' ) ) -- ) <';' ) ] T , 
where h {m,) =[h v --;h m ,f and 





1 


r 0' 2 ) r ('3) 

C 12 C 13 




L *Lm l 


A (m,) = 





1 c°' 3) 





u 2m, 










i 


c 0m ' } 

m l -l,rr 













1 


The final result is obtained 


as 


f(*) = 


m, 
1° 


<?%(*) 







k=l 



An Approach Using Genetic Algorithms 

The parameter identification procedure is generally very 
tedious for a large-scale complex system, for which the 
GA approach has some attractive features such as its 
great flexibility and robust optimization ability (Man, 
Tang, Kwong & Halang, 1997). 

GA is attributed to Holland (1975), which was ap- 
plied to fuzzy modeling and fuzzy control in the 1980s. 



GA can be used to find an optimal or suboptimal fuzzy 
model to describe a given system without manual design 
(Joo, Hwang, Kim & Woo, 1 997; Liska & Melsheimer, 
1994; Soucek & Group, 1992). In addition, GA fuzzy 
modeling method can be integrated with other compo- 
nents of a fuzzy system, so as to achieve overall superior 
performance in control and automation. 

Genetic Algorithm Preliminaries 

GA provides an optimization method, with a stochastic 
search algorithm, based on some common biological 
principles of selection, crossover and mutation. A GA 
algorithm encodes each point in a solution space into 
a string composing of binary or real values, called a 
chromosome. Each point is assigned a fitness value from 
zero to one, which is usually taken to be the same as 
the objective function to be maximized. A GA scheme 
keeps a set of points as a population, which is evolved 
repeatedly toward a better and possibly the best fit- 
ness value. In each generation, GA generates a new 
population using genetic operators such as crossover 
and mutation. Through these operations, individuals 
with higher fitness values are more likely to survive 
and to participate in the next genetic operations. After 
a number of generations, individuals with higher fit- 
ness values are kept in the population while the others 
are eliminated. GA, therefore, can ensure a gradual 
increasing of improving solutions, till a desired optimal 
or suboptimal solution is obtained. 

Basic GA Elements 

A simple genetic algorithm (SGA) was first described 
by Goldberg (1989) and is used here for illustration, 
with a pseudo-code shown below, where the population 
at time t is a time function, P = P(t), with a random 
initial population P(0). 

Procedure GA 

Begin 

t = 

Initialize P(i) 

Evaluate P(t) 

While not finished do 

Begin 

t = t+l 

Reproduce P(i) from P(t - 1) 

Crossover individuals in P(t) 
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Mutate individuals in P(t) 
Evaluate P(t) 
End 
End 

Population Representation and Initialization 

Individuals are encoded as strings (i.e., chromosomes) 
composing of some alphabets, so that the genotypes 
(chromosome values) are uniquely mapped onto the 
decision variable (phenotype) domain. The most com- 
monly used representation in GAis the binary alphabet, 
{0,1}; others are ternary, integer, real-valued, etc. 
(Takagi & Sugeno, 1985). 

The search process, described below, will operate 
on these encoding decision variables rather than the 
decision variables themselves, except when real-valued 
genes are used. After a representation method has been 
chosen to use, the first step in the SGA is to create an 
initial population, by generating the required number 
of individuals via a random number generator which 
uniformly distributes initial numbers in the desired 
range. 

Objective and Fitness Functions 

The objective function is used to measure the perfor- 
mance of the individuals over the problem domain. 
The fitness function is used to transform the objec- 
tive function value into a measure of relative fitness; 
mathematically, F(x) = g(f(x)), where f is the objective 
function, g is the transform that maps the value of f to 
a nonnegative number, and F is the resulting relative 
fitness. In general, the fitness function value corre- 
sponds to the number of offspring, and an individual 
can expect to produce this value in the next generation. 
A commonly used transform is the proportional fitness 
assignment, defined by 

where N is the population size and x. is the phenotypic 
value of individual i,i=l,—,N. 

Although the above fitness assignment ensures that 
each individual has a certain probability of reproduction 
according to its relative fitness, it does not account for 
negative objective function values. A linear transform, 
which offsets the objective function, is often used prior 



to the fitness assignment. It takes the form 

F(x) = fa(x) + b, 

where a is a positive scaling factor if the optimization 
is to maximize the objective function but is negative if 
it is a minimization, and the offset b is used to ensure 
that the resulting fitness values are all negative. 

Then, the selection algorithm selects individuals for 
reproduction on the basis of their relative fitness. 

Reproduction 

Once each individual has been assigned a fitness value, 
they can be chosen from the population with a prob- 
ability according to their relative fitness. They can then 
be recombined to produce the next generation. 

Most widely used genetic operators in GAare selec- 
tion, crossover, and mutation operators. They are often 
run simultaneously in an GA program. 

Selection 



Selection is the process of determining the number 
of trials in which a particular individual is chosen 
for reproduction. Thus, it is the number of offspring 
that an individual will produce in the mating pool, a 
temporary population where crossover and mutation 
operations are applied to each individual. The selection 
of individuals has two separate processes: 



a. 



b. 



determination of the number of trials an individual 
can expect to receive; 

conversion of the expected number of trials into 
a discrete number of offspring. 



Crossover (Recombination) 

The crossover operator defines the procedure for 
generating children from two parents. Analogous to 
biological crossover, it exchanges genes at a randomly 
selected crossover point from also randomly selected 
parents in the mating pool to generate children. 

A common method is the following: Parent chro- 
mosomes are cut at randomly selected points, which 
can be more than one, to exchange their genes at some 
specified crossover points with a user-specified cross- 
over probability. This crossover method is categorized 
into single-point crossover and multi-point crossover 
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according to the number of crossover points. Uniform 
crossover often works well with small populations of 
chromosomes and for simpler problems (Soucek & 
Group, 1992). 

Mutation 

Mutation operation is randomly applied to individuals, 
so as to change their gene value with a mutation prob- 
ability, P m , which is very low in general. 

GA Parameters 



Figure 2. A chromosome structure for fuzzy modeling 
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The choice of the mutation probability P m and the 
crossover probability P c as two control parameters can 
be a complex nonlinear optimization problem. Their 
settings are critically dependent upon the nature of the 
objective function. This selection issue still remains 
open to better resolutions. One suggestion is that for 
large population size (say 100), crossover rate is 0.6 
and mutation rate is 0.001, while for small population 
size (such as 30), crossover rate is 0.9 and mutation 
rate is 0.01 (Zalzala & Fleming, 1997). 

GA-Based Fuzzy System Modeling 

In GA, parameters for a given problem are represented 
by the chromosome. This chromosome may contain 
one or more substrings. Each chromosome contains 
a possible solution to the problem. Fitness function is 
used to evaluate how well a chromosome solves the 
problem. In the GA-based approach for fuzzy modeling, 
each chromosome represents a specific fuzzy model, 
and the ultimate goal is to carefully design a good 
(ideally optimal) chromosome to represent a desired 
fuzzy model. 

Chromosome Structure 

As an example, consider a simple fuzzy model with 
only one rule, along with the scatter partition to be 
encoded to a chromosome. 

Suppose that both real number coding and integer 
number coding are used. The structure and the pa- 
rameters of the fuzzy model are encoded into one or 
more substrings in the chromosome. A chromosome is 
composed of two substrings (candidate substring and 
decision substring) and these substrings are divided 



into two parts (IF part and THEN part), as shown in 
Fig. 2. 

The candidate substring is encoded by real numbers, 
as shown in Fig. 3 (a). It contains the candidates for 
the parameters of a membership function in the IF part, 
and the fuzzy singleton membership function in the 
THEN part. Figure 3 describes the coding format of a 
candidate substring in a chromosome, where n is the 
number of input variables, r the number of candidates 
for parameters in the IF part, and s the number of can- 
didates for the real numbers in the THEN part. 

Decision substrings are encoded by integers, which 
determine the structure and the number of rules, by 
choosing one of the parameters in the candidate sub- 
strings, as illustrated by Fig. 3 (b). 

The decision substrings for the IF part determine the 
premise structure of the fuzzy rule base. It is composed 
of n genes that take integer values (alleles) between 
and r. According to this value, an appropriate param- 
eter in the candidate substring is selected. A zero value 
means that the related input is not included in the rule. 
A decision substring for the THEN part is composed 
of c (the maximum number of rules) genes that take 
the integer values between and s, which chooses 
appropriate values from the candidate substring for 
the THEN part. In this substring, the gene taking the 
zero value deletes the related rule. Therefore, these 
substrings determine the structure of the THEN part 
and the number of rules. Figure 4 illustrates an example 
of decoding the chromosome, with the resulting fuzzy 
rule shown in Fig. 5. 
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Figure 3. Two basic functions in a chromosome 
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(a) The candidate substrings 
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(b) The decision substrings 



Fitness Function 

To measure the performance of the GA-based fuzzy 
modeling, an objective function is defined for optimi- 
zation, which is chosen by the designer and usually is 
a least-squares matching measure of the form 



J=-±(y i -y?f 



where {y.} and {y t } are the fuzzy model outputs and 
desired outputs, respectively, and n is the number of 
the data used. 



Since GA is guided by the fitness values and requires 
literally no limit on the formulation of its performance 
measure, one can incorporate more information about 
a fuzzy model into the fitness function: f = g(J , , , 

j i <j\ structure 7 



accuracy ■' 



•). One example of a fitness function is 



f(J) 



X 1-X 

— + 

J 1 + c 



where X e [0, 1 ] is the weighting factor (a large X gives 
a highly accurate model but requires a large number of 
rules), and c is the maximum number of rules. When 
the fitness function is evaluated over an empty set, it is 
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Figure 4. An example of genetic decoding process 
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Figure 5. The first fuzzy rule obtained by the decoding processes 
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undefined; but in this case one may introduce a penalty 
factor, < p < 1, and compute p • f(J) instead of f(J). 
If an individual with a very high fitness value appears 
at the earlier stage, this fitness function may cause 
early convergence of the solution, thereby stopping 
the algorithm before optimality is reached. To avoid 
this situation, the individuals may be sorted according 
to their raw fitness values, and the new fitness values 
are determined recursively by 



{i = l,f 2 = f ai = a,...,f m = a™ 

for a fitness scaling factor a e (0,1). 

GA-Based Fuzzy Modeling with Fine Tuning 

GA generally does not guarantee the convergence to a 
global optimum. In order to improve this, the gradient 
descent method can be used to fine tune the parameters 
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identified by GA. Since GA usually can find a near 
global optimum, to this end fine tuning of the member- 
ship function parameters in both IF and THEN parts, 
e.g., by a gradient descent method, can generally lead 
to a global optimization (Chang, Joo, Park & Chen, 
2002; Goldberg, 1989). 



FUTURE TRENDS 

This will be further discussed elsewhere in the fu- 
ture. 



CONCLUSION 

Fuzzy systems identification is an important and yet 
challenging subject for research, which calls for more 
efforts from the control theory and intelligent systems 
communities, to reach another high level of efficiency 
and success. 
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KEY TERMS 

Fuzzy Rule: A logical rule established based on 
fuzzy logic. 

Fuzzy System: A system formulated and described 
by fuzzy set-based real- valued functions. 

Genetic Algorithm: An optimization scheme based 
on biological genetic evolutionary principles. 

Least-Squares Algorithm: An optimization 
scheme that minimizes the square of the sum of the 
approximation errors. 

Parameter Identification: Find appropriate pa- 
rameter values in a mathematical model. 

Structure Identification: Find a mathematical 
representation of the unknown system's structure. 
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System Modeling: A mathematical formulation of 
an unknown physical system or process. 
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INTRODUCTION 

From the unicellular to the more complex pericellular 
organism needs to process the signals from its environ- 
ment to survive. The computation science has already 
observed, that fact could be demonstrated remembering 
the artificial neural networks (ANN). This computation 
tool is based on the nervous system of the animals, but 
not only the nervous cells process information in an 
organism. Every cell has to process the development 
and functioning plan encoded at its DNA and every 
one of these cells executes this program in parallel 
with the others. Another interesting characteristic of 
natural cells is that they form systems that are tolerant 
to partial failures: small errors do not induce a global 
collapse of the system. 

The present work proposes a model that is based on 
DNA information processing, but adapting it to gen- 
eral information processing. This model can be based 
on a set of techniques called Artificial Embryogeny 
(Stanley K. & Miikkulainen R. 2003) which adapts 
characteristics from the biological cells to solve dif- 
ferent problems. 



BACKGROUND 

The Evolutionary Computation (EC) field has given rise 
to a set of models that are grouped under the name of 
Artificial Embryology (AE), first introduced by Stanley 
and Miikkulainnen (Stanley K. & Miikkulainen R. 
2003). This group refers to all the models that try to apply 
certain characteristics of biological embryonic cells to 
computer problem solving, i.e. self-organisation, failure 
tolerance, and parallel information processing. 

The work on AE has two points of view. On the one 
hand can be found the grammatical models based on 
L-systems (Lindenmayer A. 1 968) which do a top-down 



approach to the problem. On the other hand can be 
found the chemical models based on the Turing's ideas 
(Turing A. 1952) which do a down-top approach. 

The grammatically approach, some times, has used 
the models for study the evolution of ANN, which is 
known as neuroevolution. The first neuroevolution 
system was development by Kitano (Kitano, H. 1 990). 
In this work Kitano shows that it was possible to evolve 
the connectivity matrix of ANN through a set of rewrite 
rules. Another remarkable work is the application of 
L-systems do by Hornby and Pollack (Hornby, G. S. 
& Pollack J. B. 2002). At this work they simultane- 
ously evolved the body morphologies and the neural 
network of artificial creatures in a simulated 3D physical 
environment. Finally, mention the works carry out by 
Gruau (Gruau F. 1 994) where the author uses grammar 
trees to encode steps in the development of a neural 
network from a single antecesor cell. 

On the chemical approach, the starting point of this 
field can be found in the modelling of gene regulatory 
networks, performedby Kauffmann in 1 969 (Kauffman 
S.A. 1969). After that, several works were carried out 
on subjects such as the complex behaviour generated 
by the fact that the differential expression of certain 
genes has a cascade influence on the expressions of 
others (Mjolsness E., Sharp D.H., & Reinitz J. 1995). 
Considering the gene regulatory networks works, the 
most relevant models are the following: the Kumar and 
Bentley model (Kumar S. & Bentley P.J 2003), which 
uses the theory of fractal proteins Bentley, P. J., Kumar, 
S. 1999; for the calculation of protein concentration; 
the Eggenberger model (Eggenberger P. 1996), which 
uses the concepts of cellular differentiation and cellular 
movement to determine cell connections; and the work 
of Dellaert and Beer (Dellaert F. & Beer R.D. 1996), 
who propose a model that incorporates the idea of 
biological operons to control the model expression, 
where the function assumes the mathematical meaning 
of a Boolean function. 
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GENETIC REGULATORY NETWORK 
MODEL 

The cells of a biological system are mainly determined 
by the DN A strand, the genes, and the proteins contained 
by the cytoplasm. The DNA is the structure that holds 
the gene-encoded information that is needed for the 
development of the system. The genes are activated or 
transcribed thanks to the protein shaped-information 
that exists in the cytoplasm, and consist of two main 
parts: the sequence, which identifies the protein that 
will be generated if the gene is transcribed, and the 
promoter, which identifies the proteins that are needed 
for gene transcription. 

Another remarkable aspect of biological genes is the 
difference between constitutive genes and regulating 
genes. The latter are transcribed only when the proteins 
identified in the promoter part are present. The constitu- 
tive genes are always transcribed, unless inhibited by 
the presence of the proteins identified in the promoter 
part, acting then as gene oppressors. 

The present work has tried to partially model this 
structure with the aim of fitting some of its abilities into 
a computational model; in this way, the system would 
have a structure similar that is similar to the above and 
will be detailed in the next section. 

Proposed Model 

Various model variants were developed on the basis 
of biological concepts. The proposed artificial cellular 
system is based on the interaction of artificial cells 
by means of messages that are called proteins. These 
cells can divide themselves, die, or generate proteins 



Figure 1. Structure of a system gene 



DNA 













GENE 




^— — -_ 


— — -__ 


TRUE 


1001 




0010 


1000 






f 


T 


t s 'l 






1001 




0010 


1000 



that will act as messages for themselves as well as for 
neighbour cells. 

The system is supposed to express a global behaviour 
towards the information processing. Such behaviour 
would emerge from the information encoded in a set of 
variables of the cell that, in analogy with the biological 
cells, will be named genes. 

The central element of our model is the artificial cell. 
Every cell has a binary string-encoded information for 
the regulation of its functioning. Following the biologi- 
cal analogy, this string will be called DNA. The cell also 
has a structure for the storage and management of the 
proteins generated by the own cell and those received 
from neighbourhood cells; following the biological 
model, this structure is called cytoplasm. 

The DNA of the artificial cell consists of functional 
units that are called genes. Each gene encodes a protein 
or message (produced by the gene). The structure of a 
gene has four parts (see Figure 1): 

• Sequence: the binary string that corresponds to 
the protein that encodes the gene 

• Promoters: is the gene area that indicates the 
proteins that are needed for the gene's transcrip- 
tion. 

• Constituent: this bit identifies if the gene is con- 
stituent or regulating 

• Activation percentage (binary value): the per- 
centage of minimal concentration of promoters 
proteins inside the cell that causes the transcription 
of the gene. 

The transcription of the encoded protein occurs 
when the promoters of the non-constituent genes ap- 
pear in a certain rate at the cellular cytoplasm. On the 
other hand, the constituent genes are expressed until 
such expression is inhibited by the present rate of the 
promoter genes. 

The other fundamental element for keeping and 
managing the proteins that are received or produced by 
the artificial cell is the cytoplasm. The stored proteins 
have a certain life time before they are erased. The 
cytoplasm checks which and how many proteins are 
needed for the cell to activate the DNA genes, and as 
such responds to all the cellular requirements for the 
concentration of a given type of protein. The cytoplasm 
also extracts the proteins from the structure in case they 
are needed for a gene transcription. 
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Figure 2. Logical operators match 
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The Information Processing Capacities 

The biological cells, besides generating structures, work 
as small processors for parallel information handling 
with the remaining cells. The information that they 
process comes from their own generation and from 
their environment. On the basis of this fact, the present 
work has explored the generation capabilities of the 
model structure, although using the gene and protein 
structure, an operation set with Boolean algebra-like 
structure might be defined. 

The space for the definition of the operations would 
be the presence or absence of certain proteins into the 
system, whereas the operation result would be the 
protein contained/encoded at the gene. The AND op- 
eration (see Figure 2) would be modelled with a gene 
that would need for its expression all the proteins of 
its promoters. The OR operation would be modelled 
with two genes that, despite their different promoters, 
result in the same protein. Finally, the NOT operation 
would be modelled with the constituent part, which 
changes the performance of that gene. The presence of 
proteins belonging to the promoters would imply the 
absence of the gene resulting protein at the system. This 
behaviour is similar to the gene regulatory networks 
(Kauffman S.A. 1969). 

The Artificial Neuron Networks (ANNs) can be 
configured for carrying out these processing tasks. 



FUTURE TRENDS 

The final objective of this group is to develop an arti- 
ficial model which is based on the biologically model 
with a processing information capacity similar to the 
ANN. In order to archive this objective some simple 
tests have been developed to check the functioning of 
the model. The result of these tests show that is pos- 
sible to process information using the gene regulatory 
network as the basing system. 

From this point of development, the next steps of 
development must go in order to develop more complex 
task and to study the functioning of the model. Other 
objective for future works can be the combination of 
the process information capacities of the model with 
the generating structure capacities presented in (Fernan- 
dez-Bianco E., Dorado J., Rabunal J.R., Gestal M. & 
Pedreira N. 2007). 



CONCLUSION 

At this work some properties of biological cells have 
been adapted to an artificial model. In particular the 
gene regulatory network idea was adapted to process- 
ing information. This adaptation has its bases on using 
the transcription rule to determine a boolean algebra 
like structure. The result of this adaptation is that, 
now, we can use it to develop processing information 
tests and. 

Finally comment that this new way of generation 
processing information networks has a lot of test and 
studies to do until it is stabilized as a consolidated 
technique for information processing. 
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KEY TERMS 

Artificial Cell: Each of the elements that process 
the orders codified into the DNA. 

Artificial Embryogeny: Under this term are all the 
processing models which use biological development 
ideas as inspiration. 

Cytoplasm: Part of an artificial cell which is respon- 
sible of management the protein-shaped messages. 

DNA: Set of rules which are responsible of the 
cell behaviour. 

Gene: Each of the rules which codifies one action 
of the cell. 

Gene Regulatory Network: Term that names the 
connexion between the different genes of a DNA. The 
connexion identifies the genes that are necessary for 
the transcription of other ones. 

Protein: This term identifies every kind of the mes- 
sages that receives an artificial cell. 
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INTRODUCTION 

Genetic algorithms (GAs) are stochastic search 
techniques based on the concepts of natural popula- 
tion genetics for exploring a huge solution space in 
identifying optimal or near optimal solutions (Davis, 
1 99 1 )(Holland, 1 992)(Reeves & Rowe, 2003), and are 
more likely able to avoid the local optima problem than 
traditional gradient based hill-climbing optimization 
techniques when solving complex problems. 

In essence, GAs are a type of reinforcement learn- 
ing technique (Grefenstette, 1993), which are able to 
improve solutions gradually on the basis of the previ- 
ous solutions. GAs are characterized by their abilities 
to combine candidate solutions to exploit efficiently a 
promising area in the solution space while stochastically 
exploring new search regions with expected improved 
performance. Many successful applications of this tech- 
nique are frequently reported across various kinds of 
industries and businesses, including function optimiza- 
tion (Ballester & Carter, 2004)(Richter & Paxton, 2005), 
financial risk and portfolio management (Shin & Han, 
1 999), market trading (Kean, 1 995), machine vision and 
pattern recognition (Vafaie & De Jong, 1 998), document 
retrieval (Gordon, 1988), network topological design 
(Pierre & Legault, 1 998)(Arabas & Kozdrowski, 200 1 ), 
job shop scheduling (Ozdamar, 1 999), and optimization 
for operating system's dynamic memory configuration 
(Del Rosso, 2006), among others. 

In this research we introduce the concept and com- 
ponents of GAs, and then apply the GA technique to 
the modeling of the batch selection problem of flexible 
manufacturing systems (FMSs). The model developed 
in this paper serves as the basis for the experiment in 
Deng (2007). 



GENETIC ALGORITHMS 

GAs were simulation techniques proposed by John 
Holland in the 1960s (Holland, 1992). Basically, GAs 



solve problems by maintaining and modifying a popu- 
lation of candidate solutions through the application 
of genetic operators. During this process, beneficial 
changes to parent solutions are combined into their 
offspring in developing optimal or near-optimal solu- 
tions for the given task. 

Intrinsically, GAs explore multiple potentially prom- 
ising regions in the solution space at the same time, 
and switch stochastically from one region to another 
for performance improvement. According to Holland 
(1992), regions in the solution space can be defined 
by syntactic patterns of solutions, and each pattern is 
called a schema. A schema represents the pattern of 
common attributes or features of the solutions in the 
same region. Let Z be an alphabet of symbols. A string 
over an alphabet is a finite sequence of symbols from 
the alphabet. An n-ary schema is defined as a string in 
(S u {#})", where # g S is used as a wildcard denota- 
tion for any symbol in S. 

Conceptually, n-ary schemata can be regarded as 
defining hypersurfaces of an n-dimensional hypercube 
that represents the space of all n-attribute solutions. 
Individual solutions in the same region can be re- 
garded as instances of the representing schema, and 
an individual solution can belong to multiple schemata 
at the same time. Actually, an n-attribute solution is a 
member of 2 n different schemata. Therefore, evaluating 
a solution has the similar effect of sampling 2 n regions 
(i.e., schemata) at the same time, and this is the famous 
implicit parallelism of genetic search. A population of 

M solutions will contain at least 2 n and at most M • 2 n 
schemata. Even for modest values of n andM, there will 
be a large number of schemata available for processing 
in the population. GAs perform an implicit parallel 
search through the space of possible schemata in the 
form of performing an explicit parallel search through 
the space of individual solutions. 

The problem solving process of GAs follows a 
five-phase operational cycle: generation, evaluation, 
selection, recombination (or crossover), and mutation. 
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At first a population of candidate solutions is generated. 
A fitness function or objective function is then defined, 
and each candidate solution in the population is evalu- 
ated to determine its performance or fitness. Based on 
the relative fitness value, two candidate solutions are 
selected probabilistically as parents. Recombination 
is then applied probabilistically to the two parents to 
form two offspring, and each of the offspring solutions 
contains some characteristics from its parent solutions. 
After this, mutation is applied sparingly to components 
of each offspring solution. The newly generated off- 
spring are then used to replace the low-fitness members 
in the population. This process is repeated until a new 
population is formed. Through the above iterative cycles 
of operations, GAs is able to develop better solutions 
through progressive generations. 

In order to prepare for the investigation of the effects 
of genetic operations in the sequel of current research, 
we apply the GA technique to the optimization model- 
ing of manufacturing systems in next section. 



A GA-BASED BATCH SELECTION 
SYSTEM 

Batch selection is one of the most critical tasks in the 
development of a master production plan for flexible 
manufacturing systems (FMSs). In the manufacturing 
process, each product requires processing by differ- 
ent sets of tools on different machines with different 
operations performed in a certain sequence. Each ma- 
chine has its own limited space capacity in mounting 
tools and limited amount of available processing time. 
Under various kinds of resource constraints, choosing 
an optimal batch of products to be manufactured in a 
continuous operational process with the purpose to 
maximize machine utilization or profits has made the 
batch selection decision a very hard problem. While this 
problem is usually manageable for manufacturing small 
number of products, it quickly becomes intractable if 
the number of products grows even slightly large. The 
time required to solve the problem exhaustively would 
grow in a non-deterministic polynomial manner with 
the number of products to be manufactured. 

Batch selection affects all the subsequent deci- 
sions in job shop scheduling for satisfying the master 
production plan, and holds the key to the efficient 
utilization of resources in generating production plans 



for fulfilling production orders. In our formulation, 
we use the following denotational symbols: 

M: the cardinality of the the set of machines 
available 

T: the cardinality of the the set of tools avail- 
able 

P: the cardinality of the set of products to be 
manufactured 

MachineUtilization: the function of total machine 
utilization 
processinq time , ,, , ,. : the time needed to 

r & — proauct,tool, machine 

manufacture product product using tool tool on 

machine machine 

available time ,. : the total available processing 

— machine r ° 

time on machine machine 

capacity , . : the total number of slots available 

r J machine 

on machine machine 

machine, tool, product: indicators for machines, 
tools, and products to be manufactured corre- 
spondingly 

slot tool : the number of slot required by machine 
tool tool 

quantity mduct : the quantity of product product to 
be manufactured in a shift 
Q mduct : the quantity of productprodi/ct ordered by 
customers as specified in the production table 

Fitness (or Objective) Function 

The objective is to identify a batch of products to be 
manufactured so that the total machine utiliztion rate 
will be maximized. See Exhibit A. 

The above objective function is to be maximized 
subject to the following resource constraints: 

1 . Machine capacity constraint (see Exhibit B) 

The above function /(•) is used to determine if tool 
tool needs to be mounted on machine machine for the 
processing of the current batch of product. 

2. Machine time constraint (see Exhibit C) 

3 . Non-negativity and integer contraints 

Encoder/Decoder 

The Encoder/Decoder is a representation scheme used 
to determine how the problem is structured in the GA 
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Exhibit A. 



Maximize MachineUtilization(quantity l , quantity 2 , , quantity p ) = 



XXX processing _ti 



time 



product , tool , machine 



quantity 



product 



machine=l tool=l product=l 



X available _time l 



machine 



Exhibit B. 



X slot tooi f ( X Processing _ time producttool .quantity product ) < capacity, 

too/=l product^! 



Z s/ot too/ f ( Z Processing _ time producttool M quantity product ) < capacity M 

tool-l product-1 



where f (y) = 



fi, if y>o 
10, if y = 



Exhibit C. 



Z X processing_time productytool .quantity pmduct < available _ time, 

product =1 tool =1 



£ £ processing_time pmduatool 

product =1 tool =1 



quantity product < available _time A 



Quantity product * 0, 

quanta product ^ Q product ,and 

quantity product is an integer, for product = 1, 2, • • •, P 



system. The way in which candidate solutions are en- 
coded is one of a central factor in the success of GAs 
(Mitchell, 1996). Generally, the solution encoding can 
be defined over an alphabet S which might consist of 
binary digits, continuous numbers, integers, or symbols. 
However, choosing the best encoding scheme is almost 
tantamount to solving the problem itself (Mitchell, 
1 996). In this research, our GA system is mainly based 
on Holland's canonical model (Holland, 1992), which 



is one of the most commonly used encoding schemes 
in practice — binary encoding. 

A candidate solution for the batch selection task is a 
vector of quantities to be manufactured for P products. 
Let the entire solution space be denoted as solution 
(see Exhibit D). 

The encoding function encodes the quantity to be 
produced for each product as an /-bit binary string, and 
then forms a concatenation of the strings for P products 



750 



Genetic Algorithm Applications to Optimization Modeling 



Exhibit D. 



solution = Y\[0,l,...,Q product ] 

product-l 

= ( quantity 1? ..., quantity p ) e ({ 0}uX) p |0 < quantity product < Q product , 
quantity product b an integer, and product = 1, 2, . . ., P} 




which are to be included in a production batch. Each 
candidate solution (quantity ']_,..., quantity p) is a string 
of length IP over the binary alphabet 2 ={0, 1}. Such 
an encoded /-bit string has a value equal to 



quantity duct , if max {Q product } < 2 l - 1 

product=l,2,...,P 



quantity 



product 



(2' -1) 



max {Q product } 

product=l,2,...,P 



0.5 



, otherwise. 



In the above formula, 2 - 1 is the value of an /- 

bit string ^ — ^ , and cp*K is the ceiling function. For 
example, assume there are only two products to be 
selected in a production batch with 200 units as the 
largest possible quantity to be manufactured for each 
product. A candidate solution consisting of quantities 
100 and 51 for products 1 and 2 respectively will be 
represented by a 16-bit string as 0110010000110011 
with the first 8 bits representing product 1 and the 
second 8 bits representing product 2. 

After a new solution string is generated, it is then 
decoded back to the format for the compuation of the 
objective function and for the check of solution fea- 
sibility. Let each /-bit segment of a solution string be 
denoted as string with string[i] as the value of the z th 
bit in the /-bit segment. The decoding function converts 
each /-bit string according to the following formula: 



£string[i].2'-\if max {Q produ J < 2 s -1 



(i \ max {Q product } 

X^ j. • r-n r-iz-1 product =1, 2, ...,P F 

2^strmg[i]-2 



2'-l 



■0.5 



, otherwise. 



Five-Phase Genetic Operations 

Our system follows the generation-evaluation-selec- 
tion-crossover-mutation cycles in searching for ap- 
propriate solution strings for the batch selection task. 
It starts with generating an initial population, Pop, of 
pop_size candidate solution strings at random. In each 
iteration of the operational cycle, each candidate solu- 
tion string, s/, in the current population is evaluated by 
the fitness function. 

Candidate solution strings in the current population 
are selected probabilistically on the basis of their fitness 
values as seeds for generating the next generation. The 
purpose of selection is to generate offspring of high 
fitness value on the basis of the fitter members in the 
current population. Actually, selection is the mechanism 
that helps our GA system to exploit a promising region 
in the solution space. There are several fitness-based 
schemes for the selection process: Roulette- wheel 
selection, rank-based selection, tournament selection, 
and elitist selection (Goldberg, 1989)(Michalewicz, 
1 994). The first three methods randomly select candidate 
solution strings for reproduction on the basis of either 
the fitness value or the rank of individual strings. Best 
members of the current population might be lost if they 
are not selected to reproduce or if they are altered by 
crossover (i.e., recombination) or mutation. The elitist 
selection strategy is for the purpose of retaining some 
of the fittest individuals from the current population. 

Elitist selection retains a limited number of "elite" 
solution strings, i.e., strings with the best fitness 
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values, for passing to the next generation without 
any modification. A fraction called the "generation 
gap" is used to specify the proportion of the popula- 
tion to be replaced by offspring strings after each 
iteration. Our GA system retains copies of the first 

(1 - generation _ gap) • pop _ size "elitist" members 
of Pop for the formation of the next population, 

p °Pnew 

For generating the rest of the members for Pop new , 

the GA module will probabilistically select: 



generation _ gap • pop _ size 



pairs of solution strings from Pop for generating off- 
spring strings. The probability of selecting a solution 
string, Sp from Pop is given by 



Pr(s f ): 



Fitness^t) 

op _ size 

^Fitness(Sj) ' 



;'=i 



Let the cumulative probability of individual solution 
strings in the population be called C , and 



c,=X Pr ( s ,)> 

7=1 

for f=l, 2,...,pop_size. The solution string s z - will be 

selected for reproduction if C M < rand (0,1) < C r 

In addition to exploiting a promising solution region 
via the selection process, we also need to explore other 
promising regions for possible better solutions. Ex- 
ploitation without exploration will cause degeneration 
for a population of solution strings, and might cause 
the local optima problem for the system. Actually, the 
capability of maintaining a balanced exploitation vs. 
exploration is a major strength of the G A approach over 
traditional optimization techniques. The exploration 
function is achieved by the crossover and mutation 
operators. These two operators generate offspring 
solutions which belong to new schemata, and thus al- 
low our system to explore other promising regions in 
a solution sapce. This process also allows our system 
to improve its performance stochastically. 



Crossover recombines good solution strings in 
the current population and proliferates the population 
gradually with schemata of high fitness values. Cross- 
over is commonly regarded as the most distinguishing 
operator of GAs, and it usually interacts in a highly 
intractable manner with fitness function, encoding, and 
other details of a GA (Mitchell, 1996). Though vari- 
ous crossover operators have been proposed, there is 
no general conclusions on when to use which type of 
crossover (Michalewicz, 1994)(Mitchell, 1996). 

In this paper, we adopt the standard one-point cross- 
over for our G A system. For each pair of solution strings 
selected for reproduction, the value of crossover _r ate 
determines the probability for their recombination. A 
position in both candidate solution strings is randomly 
selected as the crossover point. The parts of two parent 
strings after the crossover position are exchanged to 
form two offspring. Let k be the crossover point ran- 
domly generated from a uniform distribution ranging 
from 1 to IP, where IP is the length of a solution string. 
Let s. = (x v x 2 ,..., x kA , x k ,..., x lp ) and s. = (y v y 2 ,..., 
y kA , y k ,..., y lP ) represent a pair of candidate solution 
strings selected for reproduction. Based on these two 
strings, the crossover operator generates two offspring 

s; = (x;,^,...,x; p ) and s] =(y , 1 ,y , 2 ,...,y' lP ), where 



X: = 



y[ 



x t , if z <k 
y t , otherwise 

y if if i<k 
x t , otherwise. 



Inotherwords, s'. = (x 1 ,x 2 ,...,x M ,y /( ,...,y /p )and s] = 
(y v y 2 ,...,y k _ v x k ,..., x /p ). These two oppspring are then 
added to Pop new . This offspring-generating process is 

repeated until there are generation _ gap • pop _ size 
offspring generated for Pop new . 

With selection and crossover alone, our system 
might occasionally develop a uniform population 
which consists of the same solution strings. This will 
blind our system to other possible solutions. Mutation, 
which is the other operator applied to the reproduction 
process, is used to help our system avoid the formation 
of a uniform population by introducing diversity into a 
population. It is generally believed that mutation alone 
does not advance the search for a solution, and is usu- 
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ally considered as a secondary role in the operation of 
GAs (Goldberg, 1989). Usually, mutation is applied 
to alter the bit value of a string in a population only 
occasionally. Let mutation_rate be the probability of 
mutation for each bit in a candidate solution string. 

For each offspring string, s' = (x[,x' 2 ,...,x' lp ), gener- 
ated by the crossover operator for the new population 
Pop new , the mutation operator will invert each bit 
probabilitistically: 



l-x i? if rand (0,1) < mutation _ rate 
x t , otherwise. 



The probability of mutation for a candidate solution 
string is 1 - (1 - mutation _ rate) lp . 

The above processes constitute an operational cycle 
of our system. These operations are repeated until the 
termination criterion is reached, and the result is passed 
to the Decoder for decoding. The decoded result is then 
presented to the decision maker for further consideration 
in the final decision. If current solution is not satisfac- 
tory to the decision maker, the current solution can be 
modified by the decision maker, and then entered into 
the GA system to initiate another run of search process 
for satisfactory solutions. 



FUTURE TRENDS AND CONCLUSION 

In this paper we designed a GA-based system for the 
batch selection problem of flexible manufacturing 
systems. In our design we adopted a binary encoding 
scheme, the elitist selection strategy, a single-point 
crossover strategy, and a uniform random mutation 
for the batch selection problem. 

The performance of GAs is usually influenced by 
various parameters and the complicated interactions 
among them, and there are several issues worth further 
investigation. With the availability of a larger pool of 
diverse schemata in a larger population, our GA system 
will have a broader view of the "landscape" (Holland, 
1992) of the solution space, and is thus more likely to 
contain representative solutions from a large number of 
hyperplanes. This advantage gives GAs more chances 
of discovering better solutions in the solution space. 
However, Davis (1991) argues that the most effective 
population size is dependent upon the nature of the 
problem, the representation formalism, and the GA 



operators. Still, Schaffer et al. (1991) asserted that the 
best settings for population size is independent of the 
problems. In the sequel of this paper, we will conduct 
a sequence of experiment to systematically analyze the 
influence of the population size on GA performance, 
by using the batch-selection model peoposed in this 
paper, so that we can be more conclusive on the issue 
of the effective population size. 
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KEY TERMS 

Batch Selection: Selecting the optimal set of prod- 
ucts to produce, with each product requiring a set of 
resources, under the system capacity constraints 

Fitness Functions: The objective function of the 
GA for evaluating a population of solutions 

Flexible Manufacturing Systems: A manufactur- 
ing system which maintains the flexibility of order 
of operations and machine assignment in reacting 
to planned or unplanned changes in the production 
process 

Genetic Algorithms: A stochastic search method 
which applies genetic operators to a population of 
solutions for progressively generating optimal or near- 
optimal solutions 

Genetic Operators: Selection, crossover, and 
mutation, for combining and refining solutions in a 
population 

Implicit Parallelism: A property of the GA which 
allows a schema to be matched by multiple candidate 
solutions simultaneously without even trying 

Landscape: A function plot showing the state as 
the "location" and the objective function value as the 
"elevation" 

Reinforcement Learning: A learning method 
which interprets feedback from an environment to 
learn optimal sets of condition/response relationships 
for problem solving within that environment 

Schemata: A general pattern of bit strings that is 
made up of 1, 0, and #, used as a building block for 
solutions of the GA 
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INTRODUCTION 

Wireless sensor networks (WSNs) consist of a large 
number of low-cost and low-power sensor nodes. 
Some of the applications of sensor networks are en- 
vironmental observation, monitoring disaster areas 
and so on. Distributed evolutionary computing is a 
poweful tool that can be applied to WSNs, because 
these networks require algorithms that are capable of 
learning independent of the operation of other nodes 
and also capable of using local information (Johnson, 
Teredesai & Saltarelli, 2005). Evolutionary algorithms 
must be designed for the resource constraints present in 
WSNs. This article describes how genetic algorithms 
can be used in WSNs design in order to satisfy energy 
conservation and connectivity constraints. 



BACKGROUND 

The recent advances in wireless communications and 
digital electronics led to the implementation of low 
power and low cost wireless sensors. A sensor node 
must have components for sensing, data processing 
and communication. These devices can be grouped to 
form a sensor network (Akyildiz, Sankarasubrama- 
niam & Cayirci, 2002) (Callaway 2003). The network 
protocols, such as formation algorithms, routing and 
management, must have self-organizing capabilities. In 



general, sensor networks have some features that differ 
from traditional wireless networks in some aspects: 
the number of sensor nodes can be very high; sensor 
nodes are prone to failures; sensor nodes are densely 
deployed; the topology of the network can change 
frequently; sensor nodes are limited in computational 
capacities, memory and energy. 

The major challenge in the design of WSNs is the 
fact that energy resources are significantly more limited 
than in wired networks and other types of wireless 
networks. The battery of the sensors in the network 
may be difficult to recharge or replace, causing severe 
limitations in the communication and processing time 
between all sensors in the network. Thus, the main 
parameter to optimize for is the network lifetime, or 
the time until a group of sensors runs out of energy. 
Another issue in WSN design is the connectivity of 
the network according to the selected communication 
protocol. Usually, the protocol follows the cluster-based 
architecture, where single hop communication occurs 
between sensors of a cluster and a selected cluster head 
sensor that collects all information obtained by the 
other sensors in its cluster. This architecture is shown 
in Figure 1 . Since the purpose of the sensor network 
is the collection and management of measured data for 
some particular application, this collection must meet 
specific requirements depending on the type of data. 
These requirements are turned into application specific 
parameters of the network. 



Figure 1. Cluster-based sensor network 
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O sensor node 



Cluster 2 



Cluster 3 
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GENETIC ALGORITHMS FOR 
WIRELESS SENSOR NETWORKS 

A WSN designer who takes into account all the design 
issues deals with more than one non-linear objective 
functions or design criteria which should be optimized 
simultaneously. Therefore, the focus of the problem is 
how to find many near-optimal non-dominated solu- 
tions in a practically acceptable computational time 
(Jourdan & de Week, 2004) (Weise, 2006) (Ferentinos 
& Tsiligiridis, 2007). There are several interesting 
approaches to tackling such problems, but one of the 
most powerful heuristics, which is also appropriate 
to apply in the multi-objective optimization problem, 
is based on genetic algorithms (GA) (Ferentinos & 
Tsiligiridis, 2007). 

Genetic algorithms have been used in many fields 
of science to derive solutions for any type of problems 
(Goldberg 1989) (Weise, 2006). They are particularly 
useful in applications involving design and optimiza- 
tion, where there are large numbers of variables and 
where procedural algorithms are either non-existent or 
extremely complicated (Khana, Liu & Chen, 2006), 
(Khana, Liu & Chen, 2007). In nature, a species adapts 
to an environment because the individuals that are the 
fittest in respect to that environment will have the best 
chance to reproduce, possibly creating even fitter child. 
This is the basic idea of genetic evolution. Genetic 
algorithms start with an initial population of random 
solution candidates, called individuals or chromosomes. 
In the case of sensor networks, the individuals are small 
programs that can be executed on sensor nodes (Wazed, 
Bari, Jaekel & Bandyopadhyay, 2007). 

Each individual maybe represented as a simple string 
or array of genes, which contain a part of the solution. 
The values of genes are called alleles. As in nature, the 
population will be refined step by step in a cycle of 
computing the fitness of its individuals, selecting the 
best individuals and creating a new generation derived 
from these. A fitness function is provided to assign the 
fitness value for each individual, based on how close 
an individual is to the optimal solution. Two randomly 
selected individuals, the parents, can exchange genetic 
information in a process called crossover to produce 
two new chromosomes know as child. A process called 
mutation may also be applied to obtain a good solution, 
after the process of crossover. This process helps to re- 
store any genetic values when the population converges 



too fast. After the crossover and mutation processes 
the individuals of the next generation are selected. 
Some of the poorest individuals of the generation can 
be replaced by the best individuals from the previous 
generation. This is called elitism, and ensures that the 
new generation is at least as fit as the previous genera- 
tion. The algorithm stops if a predetermined stopping 
criterion is met (Hussain, Matin & Islam, 2007). 

Fitness Function and Specific 
Parameters for WSNs 

The fitness function executed in a sensor node is a 
weighted function that measures the quality or per- 
formance of a solution, in this case a specific sensor 
network design. This function is maximized by the GA 
system in the process of evolutionary optimization. A 
fitness function must include and correctly represent 
all or at least the most important factors that affect the 
performance of the system. The major issue in develop- 
ing a fitness function is the decision on which factors 
are the most important ones (Ferentinos & Tsiligiridis, 
2007) (Gnanapandithan & Natarajan, 2006). 

A genetic algorithm must be designed for WSN 
topologies by optimizing energy-related parameters 
that affect the battery consumption of the sensors and 
thus, the lifetime of the network. At the same time, the 
algorithm has to meet some connectivity constraints 
and optimize some physical parameters of the WSN 
implemented by the specific application. The multiple 
objectives of the optimization problem are blended into 
a single objective function, the parameters of which are 
combined to formulate a fitness function that gives a 
quality measure to each WSN topology. Three sets of 
parameters dominate the design and the performance 
of a WSN: the application specific parameters, con- 
nectivity parameters and the energy related parameters. 
Some possible parameters are discussed in (Ferentinos 
& Tsiligiridis, 2007): 

Operation energy: the energy that a sensor con- 
sumes during some specific time of operation. It 
depends whether the sensor operates as cluster 
head or as regular sensor. 
Communication energy: the energy consump- 
tion due to communication between sensors. It 
depends on the distances between transmitter and 
receiver. 
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Battery life: battery capacity of each sensor. 
Sensors-per-cluster head: parameter to ensure 
that each cluster head does not have more than 
a maximum predefined number of sensors in its 
cluster. It depends on the physical communica- 
tions capabilities and the amount of data that can 
be processed by a cluster head. 
Sensors out of range error: parameter to ensure 
that each sensor can communicate with its clus- 
ter head. It depends on the signal strength of the 
sensors. 

Spatial density: minimal number of measurements 
points that adequate monitor the variables of a 
given area. 
• Uniformity of measurement: the measures of an 
area of interest must give a uniform view of the 
area conditions. The total area can be divided in 
several sub-areas for a uniform measurement. 

Other parameters can be defined, especially those 
related to application specific requirements, such as 
sensor to sink delay, routing information, localization, 
network coverage, etc. The optimization problem is 
defined by the minimization of the WSN parameters. 
If n optimization parameters were defined, they may 
be combined into a single objective function: 



mm 



5> f p f , 



FUTURE TRENDS 

Some of the recent research areas in wireless sensor 
networks include the design of MAC protocols, efficient 
routing, data aggregation, collaborative processing, 
sensor fusion, security, localization, data reliability, 
network management, etc. All these topics may benefit 
from the usage of genetic algorithms. Some research 
has been made using genetic algorithms to solve some 
WSNs problems (Hussain, Matin & Islam, 2007) (Jin, 
Liu, Hsu & Kao, 2005) (Ferentinos & Tsiligiridis, 2007) 
(Wazed, Bari, Jaekel & Bandyopadhyay, 2007) (Rah- 
mani, Fakhraie, & Kamarei, 2006) (Qiu, Wu, Burns, 
& Holzhauer, 2006). However, most of the research 
topics of WSNs using genetic algorithms remain few 
or completely unexplored. 



CONCLUSION 

This article discussed the application of genetic algo- 
rithms in wireless sensor networks. The basic idea of 
GA was discussed and some specific considerations 
for WSNs were made, including crossover, mutation 
and definition of the fitness function. The mainly per- 
formance parameters may be divided in three groups: 
energy, connectivity and application specific. Since 
WSNs have many objectives to be optimised, GA is a 
promising candidate to be used in WSNs design. 




where P is the parameter objective and w is the weight- 
ing coefficients, that define the importance of each 
parameter in the network design. The importance of 
each parameter on the performance of the network 
has to be designed carefully. These values are firstly 
determined based on experience on the importance of 
each one. Then, some experimentation is made to de- 
termine the final values. An individual will be selected 
to be the parent of the next generation using its fitness 
value. The probability that an individual be chosen is 
proportional to the value. After this process, the type 
of crossover and mutation has to be defined, as well as 
the population size and the probabilities for crossover 
and mutation. Some experiments must be carried out 
to determine the most appropriate values for WSNs. 
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KEY TERMS 

Cluster-Based Architecture: Sensor networks 
architecture where communication occurs between 
sensors of a cluster and a selected cluster head that 
collects the information obtained by the sensors in its 
cluster. 

Cluster Head: Sensor node responsible for gather- 
ing data of a sensor cluster and transmitting them to 
the sink node. 

Crossover: Genetic operator used to vary the pro- 
gramming of a chromosome or chromosomes from one 
generation to the next. 

Energy Parameters: Parameters that affect the 
battery consumption of the sensors, including the 
energy consumed due to sensing, communication and 
computational tasks. 

Fitness Function: A particular type of objective 
function that quantifies the optimality of a solution in 
a genetic algorithm. 

Genetic Algorithms: Search technique used in 
computing to find true or approximate solutions to 
optimization and search problems. 

Mutation: The occasional (low probability) altera- 
tion of a bit position. 

Network Lifetime: Time until the first sensor node 
or group of sensor nodes in the network runs out of 
energy. 

Sensor Node: Network node with components for 
sensing, data processing and communication. 

Wireless Sensor Networks: A network of spatially 
distributed devices using sensors to monitor conditions 
at different locations, such as temperature, sound, 
pressure, etc. 
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INTRODUCTION 

Fuzzy Logic (FL) and fuzzy sets in a wide interpreta- 
tion of FL (in terms in which fuzzy logic is coexten- 
sive with the theory of fuzzy sets, that is, classes of 
objects in which the transition from membership to 
non membership is gradual rather than abrupt) have 
placed modelling into a new and broader perspective 
by providing innovative tools to cope with complex 
and ill-defined systems. The area of fuzzy sets has 
emerged following some pioneering works of Zadeh 
(Zadeh, 1965 and 1973) where the first fundamentals 
of fuzzy systems were established. 

Rule based systems have been successfully used to 
model human problem-solving activity and adaptive 
behaviour. The conventional approaches to knowledge 
representation are based on bivalent logic. A serious 
shortcoming of such approaches is their inability to come 
to grips with the issue of uncertainty and imprecision. 
As a consequence, the conventional approaches do not 
provide an adequate model for modes of reasoning. 
Unfortunately, all commonsense reasoning falls into 
this category. 



The application of FL to rule based systems leads 
us to fuzzy systems. The main role of fuzzy sets is 
representing Knowledge about the problem or to 
model the interactions and relationships among the 
system variables. There are two essential advantages 
for the design of rule-based systems with fuzzy sets 
and logic: 

The key features of knowledge captured by fuzzy 
sets involve handling uncertainty. 
Inference methods become more robust and flex- 
ible with approximate reasoning methods of fuzzy 
logic. 

Genetic Algorithms (GAS) are a stochastic optimiza- 
tion technique that mimics natural selection (Holland, 
1975). GAs are intrinsically robust and capable of 
determining a near global optimal solution. The use 
of GAS is usually recommended for optimization in 
high-dimensional, multimodal complex search spaces 
where deterministic methods normally fail. GAs explore 
a population of solutions in parallel. The GAis a search- 
ing process based on the laws of natural selections and 



Figure 1. A typical GA cycle 
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genetics. Generally, a simple GA contains three basic 
operations: selection, genetic operations and replace- 
ment. A typical GA cycle is shown in Fig. 1 . 

In this paper it is shown how a genetic algorithm 
can be used in order to optimize a fuzzy system which 
is used in wave reflection analysis at submerged 
breakwaters. 



it is a novel approach to estimate reflection coefficient, 
since a GA will determine the membership functions 
for each variable involved in the fuzzy system. 



ANALYSIS OF WAVE REFLECTION AT 
SUBMERGED BREAKWATERS WITH A 
GENETIC FUZZY SYSTEM 



BACKGROUND 

Many works have been done in the area of artificial 
intelligence applied to Coastal Engineering. It can be 
said that Artificial Intelligence methods have a wide 
acceptance among Coastal & Ports Engineers. Artificial 
Neural Network has been applied for years with very 
good results. The big drawback is their inability to 
explain their results, how have reached them, because 
they work as a black box and it can not be known 
what happen inside them. Over the last few years, a 
lot of works about fuzzy systems with engineering 
applications have been developed (Mercan, Yagci & 
Kabdasli, 2003; Dingerson, 2005; Gezer, 2004; Ross, 
2004; Oliveira, Souza & Mandorino, 2006; Ergin, 
Williams & Micallef, 2006; Yagci, Mercan, Cigizoglu 
& Kabdasli, 2005). These systems have the advantage 
of being easy to understand (their solutions) and the 
capacity to handle uncertainty. However, most of these 
found a problem with knowledge extraction; when they 
try to define their RB and DB, in many cases for the 
difficulty of the problem and more often for the dif- 
ficulty of represent all the expert knowledge in some 
rules and membership function. 

To overcome these problems Genetic Fuzzy Sys- 
tems (GFS) emerged, in which expert advice it is not 
as important as in Fuzzy System (FS) since it could 
be only needed to define the variables involved and 
its work domain. GFS (Cordon, et al., 2001) allow us 
to be less dependent on expert knowledge and in ad- 
dition it is easier to reach better accuracy with these 
systems since they can realize a tuning process for 
membership functions and refine the rule set in order 
to optimize it. Following a specific application of GFS 
for wave reflection analysis at submerged breakwaters 
is presented. 

While other kinds of techniques have been applied 
to that problem (Taveira, 2005; Kobayasi & Wurjanto, 
1989;Abul-Azm, 1993;Losada,Silva&Losada, 1999), 



Fuzzy rule-based systems can be used as a tool for 
modelling non-linear systems especially complex physi- 
cal systems. It is well known fact that the breakwater 
damage ratio estimation process is dynamic and non- 
linear, so classical methods cannot be able to capture 
this behaviour resulting in unsatisfactory solutions. 

The Knowledge Base (KB) is the FS component 
comprising the expert knowledge knows about the 
problem. So is the only component of the FS depending 
on the concrete application and it makes the accuracy 
of the FS depends directly on its composition. The KB 
is comprised of two components, a Data Base (DB), 
containing the definitions of fuzzy rules linguistic labels, 
that is, the membership functions of the fuzzy sets, and 
a Rule Base (RB), constituted by the collection of fuzzy 
rules representing the expert knowledge. 

There are many tasks that have to be performed in 
order to design a concrete FS. As it has been shown 
previously, the derivation of the KB is the only one 
directly depending on the problem to solve. It is known 
that the more used method in order to perform this task 
is based directly on extracting the expert experience 
from the human process operator. The problem arises 
when there are not able to express their knowledge in 
terms of fuzzy rules. In order to avoid this drawback, 
researches have been investigating automatic learning 
methods for designing FSs by deriving automatically 
an appropriate KB for the FS without necessary of its 
human expert. 

The Genetic algorithms (GA) have demonstrated to 
be a powerful tool for automating the definition of the 
KB since adaptativa control, learning and self-organiza- 
tion can be considered in a lot of cases as optimization 
or search process. The fuzzy systems making use of GA 
in their design process are called generically GFSs. 

These advantages have extended the use of GAs 
in the development of a wide range of approaches 
for designing FSs in the last years. It is possible to 
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distinguish three different groups of genetic FS design 
process according to the KB components included in 
the learning process. These ones are the following: 

Genetic definition of the Fuzzy System Data Base 
(Bolata and Nowe, 1995; Fathi-Torbaghan and 
Hildebrand, 1994; Herrera and Verdegay, 1995b; 
Karr, 1991b). 

Genetic derivation of the Fuzzy System Rule Base 
(Bonarini, 1993; Karr, 1991a; Thrift, 1991). 
Genetic learning of the Fuzzy System Knowledge 
Base (Cooper and Vidal, 1993; Herrera, Lozano 
and Verdegay, 1995a; Leitch and Probert, 1994; 
Lee and Takagi, 1993; Ng and Lee, 1994). 

In this paper, we create a Fuzzy System which 
predicts reflection coefficient at a different model of 
submerged breakwaters. To do this task, a part of this 
Fuzzy System, the Data Base, is defined and tuning by 
a Genetic Algorithm. 



Hs: significant wave height, 
d: water depth. 
Tp: peak period or 
Lp: peak wavelength 

These are parameters that connect the submerged 
breakwater model and the wave. The parameters that 
identified the submerged breakwater model (see fig. 3) 
are: the height (h) and the crest width (B), n (cotangent 
a), breakwater slope (a) and slope nature (smooth or 
rough). To predict the reflection coefficient, the first ones 
were used but in many cases dimensionless parameters 
were used instead the parameters separately. 

A lot of tests were done with different number of 
input variables and different number of fuzzy sets for 
each membership function. Depending of the variables 
and membership function number, a set of rules were 
established for each case. 



PHYSICAL TEST 




SUBMERGED BREAKWATER DOMAIN 

Submerged breakwaters are effective shore protection 
structure against wave action with a reduced visual 
impact (see fig. 2). 

To predict reflection coefficient several parameters 
have to be taken into account, they are: 

Re: water level above crest. 



A large number of tests have been carried out (Taveira- 
Pinto, 2001) with different water deeps and wave 
conditions for each model (figure 3 shows the general 
layout of the tested models). Eight impermeable physi- 
cal models have been tested with different geometries 
(crest width, slope), different slope nature (smooth, 
rough), values for tan a (from 0.20 to 1.00) and n 
(from 1 to 5 ) in the old unidirectional wave tank of the 
Hydraulics Laboratory of the Faculty of Engineering 
of the University of Porto. 



Figure 2. Outline of a submerged breakwater and its action 



Wate direction 
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Figure 3. Diagram of interesting variables taken into account in a submerged breakwater 




GENETIC FUZZY SYSTEM 

The target of the GA is find the better distribution for 
the membership functions (optimization task) inside 
of the domain of each variable, so that minimizes the 
error of the created fuzzy system when it is applied to 
the training set 

Genome Encoding 

Each individual of the GA represents the Data Base 
of the fuzzy system that means all the membership 
functions. Each gen contains the position of one point 
of one membership function. As can be seen in fig. 4, 
one variable X with all its fuzzy sets is coding as a 
chain of real numbers. 

The used codification allows different kinds of 
membership functions (triangular, trapezoid, Gauss- 
ian, etc..) codifying the representative points in the 
chromosome so the resultant chromosome is variable 
size. 

Genetic Operators 

Genetic operators were limited in order to generate 
meaningful fuzzy systems. 

Crossover: The classical crossover operator, with 
one-point, n-point or uniform crossover, has to 
be limited in its possible cross points. To avoid 



meaningless membership functions it is only al- 
lows exchange the genetic material corresponding 
to whole variables. 

Mutation: When a mutation happens, the new 
value of the gen will be between a lower and 
an upper limit, both have worked out from the 
neighbour points of the corresponding member- 
ship function and its neighbour membership 
functions. 

Selection: The selection method is tournament 
with elitism (Blickle, 1997). 

Fitness 

The way of find out what individual is better than other is 
the fitness function. In this case, one individual represent 
a part of a fuzzy system (DB) and with the rest of the 
fuzzy system (static RB) the fitness of that individual 
can be calculate. For that aim the physical test is split 
in two new sets, one was used as a training set and the 
other as a test set. For each physical test of the train- 
ing set, the corresponding value for the input variables 
are introducing in the fuzzy system (individual in the 
genetic population). Once is calculated the output with 
a Mandani (Mandani, 1977) strategy and a Centroid 
defuzzification method, the result is compared to the 
output of the physical test; the difference is piled up 
for every tests in the training set and once all test have 
been introduced in the fuzzy system (one individual 
from the GA) and have been calculated its error, the 
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Figure 4. Piece of a chromosome. X contains the position of one point (i) of one membership function (j) 





addition of the errors is the fitness function value for 
the individual. The smaller is the total error the better 
is the individual. 

Results 



The training set was made up of 24 physical tests 
and the medium square error in that step was 0.84. 
Resultant membership functions can be seen in fig. 5. 
The test set was made up of 11 physical tests and the 
mean square error in that step was 0.89. 



Good results were obtained (from 85% to 95% of suc- 
cess) for the different tests done. Tests differ from one 
another for the number of input variables and the number 
of rules as well as genetic algorithm parameters. An 
easy understanding test is explained following: 

Selected dimensionless parameters: Rc/Hs and 

d/Lp. 

Both input variables were split in two (Low and 

High) trapezoidal membership functions. 

The output variable Cr (reflection coefficient) was 

split in three (Low, Medium and High) trapezoidal 

membership functions. 

The rule set was made up of by three rules: 

o If (Rc/Hs = Low) and (d/Lp = Low) then 

(Cr = High) 
o If (Rc/Hs = Low) and (d/Lp = High) then 

(Cr = Medium) 
o If (Rc/Hs = High) and (d/Lp = Low) then 

(Cr = Medium) 



FUTURE TRENDS 

Give the GA the capacity to optimize rules so that the 
system definition becomes easier and better results can 
be reached. The GAmust be able to generate individu- 
als with different number rules and different kind of 
rules at the same time that these individuals represent 
different membership functions. 



CONCLUSION 

A Genetic Fuzzy System was development to 
estimate the wave reflection coefficient at sub- 
merged breakwaters. 

Good results were obtained (near to 90% accu- 
racy) but better results (near to 97% accuracy) are 
difficult to understand inside the fuzzy theory. 
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Figure 5. Resultant membership functions from tuning process of a DB by GA 
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It is a hard task to choose the rule set and further- 
more the system's accuracy depends on this set a 
lot. 

The more inputs the problem have the more dif- 
ficult become to define the rule set. 
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KEY TERMS 

Fuzzification: Establishes a mapping from crisp 
input values to fuzzy set defined in the universe of 
discourse of that input. 

Fuzzy System (FS): Any FL-based system, which 
either uses FL as the basis for the representation of dif- 
ferent forms of knowledge, or to model the interactions 
and relationships among the system variables. 

Genetic Algorithm: General-purpose search algo- 
rithms that use principles by natural population genetics 
to evolve solutions to problems 

Genetic Fuzzy System: A fuzzy system that is 
augmented with an evolutionary learning process. 



Mamdani Fuzzy Rule-Based System : Arule based 
system where fuzzy logic (FL) is used as a tool for 
representing different forms of knowledge about the 
problem at hand, as well as for modelling the interactions 
and relationships that exist between its variables. 

Mamdani Inference System: Derives the fuzzy 
outputs from the inputs fuzzy sets according to the 
relation defined through fuzzy rules. Establishes a 
mapping between fuzzy sets U = U 1 x U 2 x . . . x U n 
in the input domain of X r .., X n and fuzzy sets V in 
the output domain of Y. The fuzzy inference scheme 
employs the generalized modus ponens, an extension 
to the classical modus ponens (Zadeh, 1973). 

Takagi-Sugeno-Kang Fuzzy Rule-Based System: 

A rule based system whose antecedent is composed of 
linguistic variables and the consequent is represented 
by a function of the input variables. 
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INTRODUCTION 

Evolutionary computation (EC) is the study of com- 
putational systems that borrow ideas from and are 
inspired by natural evolution and adaptation (Yao & 
Xu, 2006, pp. 1 - 1 8). EC covers a number of techniques 
based on evolutionary processes and natural selection: 
evolutionary strategies, genetic algorithms and genetic 
programming (Keedwell & Narayanan, 2005). 

Evolutionary strategies are an approach for effi- 
ciently solving certain continuous problems, yielding 
good results for some parametric problems in real 
domains. Compared with genetic algorithms, evolu- 
tionary strategies run more exploratory searches and 
are a good option when applied to relatively unknown 
parametric problems. 

Genetic algorithms emulate the evolutionary process 
that takes place in nature. Individuals compete for sur- 
vival by adapting as best they can to the environmental 
conditions. Crossovers between individuals, mutations 
and deaths are all part of this process of adaptation. By 
substituting the natural environment for the problem 
to be solved, we get a computationally cheap method 
that is capable of dealing with any problem, provided 
we know how to determine individuals' fitness (Man- 
rique, 2001). 

Genetic programming is an extension of genetic 
algorithms (Couchet, Manrique, Rios & Rodriguez- 
Paton, 2006). Its aim is to build computer programs 
that are not expressly designed and programmed by a 
human being. It can be said to be an optimization tech- 
nique whose search space is composed of all possible 
computer programs for solving a particular problem. 
Genetic programming's key advantage over genetic 



algorithms is that it can handle individuals (computer 
programs) of different lengths. 

Grammar-guided genetic programming (GGGP) 
is an extension of traditional GP systems (Whigham, 
1995, pp. 33-41). The difference lies in the fact that 
they employ context-free grammars (CFG) that gen- 
erate all the possible solutions to a given problem as 
sentences, establishing this way the formal definition of 
the syntactic problem constraints, and use the deriva- 
tion trees for each sentence to encode these solutions 
(Dounias, Tsakonas, Jantzen, Axer, Bjerregard & von 
Keyserlingk, D. 2002, pp. 494-500). The use of this 
type of syntactic formalisms helps to solve the so-called 
closure problem (Whigham, 1996). To achieve closure 
valid individuals (points that belong to the search 
space) should always be generated. As the generation 
of invalid individuals slows down convergence speed a 
great deal, solving this problem will very much improve 
the GP search capability. The basic operator directly 
affecting the closure problem is crossover: crossing 
two (or any) valid individuals should generate a valid 
offspring. Similarly, this is the operator that has the 
biggest impact on the process of convergence towards 
the optimum solution. Therefore, this article reviews 
the most important crossover operators employed in 
GP and GGGP, highlighting the weaknesses existing 
nowadays in this area of research. We also propose a 
GGGP system. This system incorporates the original 
idea of employing ambiguous CFG to overcome these 
weaknesses, thereby increasing convergence speed and 
reducing the likelihood of trapping in local optima. 
Comparative results are shown to empirically cor- 
roborate our claims. 
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BACKGROUND 

Koza defined one of the first major crossover operators 
(KX) (1992). This approach randomly swaps subtrees 
in both parents to generate offspring. Therefore, it 
tends to disaggregate the so-called building blocks 
across the trees (that represent the individuals). The 
building blocks are those subtrees that improve fitness. 
This over-expansion has a negative effect on the fit- 
ness of the individuals. Also, this operator's excessive 
exploration capability leads to another weakness: an 
increase in the size of individuals, which affects system 
performance, and results in a lower convergence speed 
(Terrio & Hey wood, 2002). This effect is known as 
bloat or code bloat. 

There is another important drawback: many of 
the generated offspring are syntactically invalid as 
the crossovers are done completely at random. These 
individuals should not be part of the new population 
because they do not provide a valid solution. This 
seriously undermines the convergence process. Figure 
1 shows a situation where one of the two individuals 
generated after Koza's crossover breaches the con- 
straints established by a hypothetical grammar whose 
sentences represent arithmetic equalities. 

The strong context preservative crossover operator 
(SCPC) avoids the problem of desegregation of building 



blocks (also called context) across the trees by setting 
severe (strong) constraints for tree nodes considered 
as possible candidates for selection as crossover nodes 
(D'haesler, 1994,pp. 379-407). Asystem of coordinates 
is defined to univocally identify each node in a deriva- 
tion tree. The position of each node within the tree is 
specified along the path that must be followed to reach 
a given node from the root. To do this, the position of a 
node is described by means of a tuple of n coordinates 
T = (b 1? b 2 b n ), where n is the node's depth in the 
tree, and b. indicates which branch is selected at depth i 
(counting from left to right). Figure 2 shows an example 
representing this system of coordinates. 

Only nodes with the same coordinates from both 
parents can be swapped. For this reason, a subtree may 
possibly never migrate to another place in the tree. This 
limitation can cause serious search space exploration 
problems, as the whole search space cannot be covered 
unless each function and terminal appears at every pos- 
sible coordinate at least once in any one individual in 
the population. This failure to migrate building blocks 
causes them to evolve separately in each region, causing 
a too big an exploitation capability, thereby increasing 
the likelihood of trapping in local optima (Barrios, 
Carrascal, Manrique & Rios, 2003, pp. 275-293). 

As time moves on, the code bloat phenomenon 
becomes a serious problem and takes an ever more 
prominent role. To avoid this, Crawford-Marks & 



Figure 1. Incorrect operation of Koza's crossover operator 
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Figure 2. The system of coordinates defined in SCPC 




(2,1,3,1) 



Spector (2002) developed the Fair crossover (pp. 
733-739). This is a modified version of the approach 
proposedbyLangdon(1999,pp. 1092-1097). Tree size 
is controlled as follows. First, a crossover node in the 
first parent is selected at random and the length, /, of 
the subtree extending from the node to the leaves is 
calculated. Then, a node is also selected at random in 
the second parent, and the length, / 2 , for this second 
subtree is calculated. If l 2 is within the range [/ - Z/4, 
/ + 1/4], then the crossover node for the second parent 
is accepted, and the two subtrees are swapped. If not, 
another crossover node is selected at random for the 
second parent and the check is run again. This way, the 
size of the subtree in the second parent to be swapped is 
controlled and limited, so the code bloat phenomenon 
is avoided. Another aspect to comment here is that the 
range in which / 2 must be included can be modified to 
afford specific problems more efficiently, but the range 
originally proposed works fine for most of them. 

Whigham proposed one of the most commonly used 
operators (WX) in GGGP(1995, pp. 33-41).Becauseof 
its sound performance in such systems, it has become the 
de facto standard and is still in use today (Rodrigues & 
Pozo, 2002, pp. 324-333), (Hussain, 2003), (Grosman& 
Lewin, 2004, pp. 2779-2790). The algorithm works as 
follows. First, as all the terminal symbols have at least 
one non-terminal symbol above them, then, without 
loss of generality, the crossover nodes can be confined 
exclusively to locations on nodes containing non-termi- 
nal symbols. Anon-terminal node belonging to the first 
parent is selected at random. Then a non-terminal node 
labeled with the same non-terminal symbol as in the 
first-chosen crossover node is selected from the second 
parent. This assures that generated individuals belong 



to the grammar-generated language, as the crossed 
nodes share the same symbol. This operator's main 
flaw is that there are other possible choices of node in 
the second parent that are not explored and that could 
end in the target solution (Manrique, Marquez, Rios 
& Rodriquez-Paton, 2005, pp. 252-261). 



THE PROPOSED CROSSOVER 
OPERATOR FOR GGGP SYSTEMS 

The proposed operator is a general-purpose operator de- 
signed to work in any GGGP system. It takes advantage 
of the key feature that defines a CFG as ambiguous: the 
same sentence can be obtained by several derivation 
trees. This implies that there are several individuals 
representing the solution to a problem. It is therefore 
easier to find. This operator consists of eight steps: 

1. Choose a node, except the axiom, with a non- 
terminal symbol randomly from the first parent. 
This node is called crossover node and is denoted 
CN1. 

2. Choose the parent of CN1. As we are working 
with a CFG, this will be a non-terminal symbol. 
The right-hand sides of all its production rules 
are stored in the array R. 

3 . The derivation produced by the parent of CN 1 is 
called main derivation, and is denoted A ::= C. 
Calculate the derivation length / as the number 
of symbols in the right-hand side of the main 
derivation. Having /, the position (p) of CN1 in 
the main derivation and C, define the three-tuple 
T(/, p, C). 

4. Delete from R all the right-hand sides with dif- 
ferent lengths from the main derivation. 

5. Remove from R all those right-hand sides in 
which there exists any difference between the 
symbols (except the one located in position p) 
in each right-hand side and the symbols in C. 

6. The set X is formed by all the symbols in the right- 
hand sides of R that are in position p. X contains 
all the non-terminal symbols of the second parent 
that can be chosen as a crossover node (CN2). 

7. Choose CN2 randomly from X, discarding all 
the nodes that will generate offspring trees with 
a size greater than a previously established value 
D. 




769 



Grammar-Guided Genetic Programming 



8. Calculate the two new derivation trees produced 
as offspring by swapping the two subtrees whose 
roots are CN1 and CN2. 

The underlying idea of this algorithm consists on 
calculating which are the non-terminal symbols that 
can substitute the symbol contained in CN1 , bearing in 
mind that the production rule that contains CN1 keeps 
being valid. Since all non-terminal symbols that can 
generate valid production rules are taken into account 
in the crossover process, this operator takes advantage 
of ambiguous grammars. 

The proposed crossover operator has primarily three 
attractive features: a) step 7 states a code bloat control 
mechanism, b) the offspring produced are always com- 
posed of two valid trees and c) step 6 indicates that all 
the possible nodes of the second parent that can generate 
valid individuals are taken into account, not only those 
nodes with the same non-terminal symbol as the one 
chosen for the first parent. This third feature increases 
the GGGP system's exploration capability, which avoids 
trapping in local optima and takes advantage of there 
being more than one derivation tree (potential solution 
to the problem) for a single sentence. 



Results 

We present and discuss the results achieved by the 
crossover operators described in the background sec- 
tion and the operator that we propose. To do so, we 
have tackled a complex classification problem: the 
real-world task of providing breast cancer prognosis 
(benign or malignant) from the morphological char- 
acteristics of microcalcifications. Microcalcifications 
are small mineral deposits in breast tissue that could 
constitute cancer. This experiment involved searching 
a knowledge base of fuzzy rules that could give such 
a prognosis. 

The data employed for giving a disease prognosis 
are: patient's age, lesion size, lesion location in the 
breast, and particular features of the microcalcifications: 
number, distribution and type. Number indicates the 
quantity of existing clustered microcalifications, distri- 
bution shows how they are clustered and type reflects 
the individual morphology of the microcalcifications. 
To run the tests, 365 microcalcifications were selected at 
random. Of these, 315 lesions were randomly selected 
for use as genetic programming system training cases 
with the different crossover operators described. After 
training, the fittest individual was selected to form a 
knowledge base with the fuzzy rules encoded by this 
individual. Then, the knowledge base was tested with 



Figure 3. Average convergence speed for each crossover operator 
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the 50 remaining lesions not chosen during the train- 
ing phase to output the number of correctly classified 
patterns in what we have called the testing phase. 

The CFG employed was formed by 1 9 non-terminal 
symbols, 54 terminals and 5 1 production rules, some of 
them included to obtain an ambiguous grammar. The 
population size employed was 1000, the upper bound 
for the size of the derivation trees was set to 20. The 
fitnesss function consisted of calculating the number 
of well-classified patterns. Therefore, the greater the 
fitness, the fitter the individual is, with the maximum 
limit of 3 1 5 in the training phase and 50 in the test. 

Figure 3 shows the average evolution process for 
each of the five crossover operators in the training 
phase after 100 executions. 

It is clear from Figure 3 that KX yields the worst 
results, because it maintains an over-diverse population 
and allows invalid individuals to be generated. This 
prevents it from focusing on one possible solution. 
The effect of Fair is just the opposite, leading very 
quickly to one of the optimal solutions (this is why it 
has a relatively high convergence speed initially), and 



slowing down if convergence is towards a local optimum 
(which happens in most cases). WX and SCPC produce 
good results, bettered only by the proposed crossover. 
Its high convergence speed evidences the benefits of 
taking into account all possible nodes of the second 
parent that can generate valid offspring. 

Table 1 shows examples of fuzzy rules output in one 
of the executions for the best two crossover operators 
— WX and the proposed operator — once the training 
phase was complete. 

Table 2 shows the average number (rounded up 
or down to the nearest integer) of correctly classified 
patterns after 100 executions, achieved by the best indi- 
vidual in the training and test phases, and the percentage 
of times that the system converged prematurely. 

KX again yields the worst results, correctly classify- 
ing just 57.46% (181/315) of patterns in the training 
phase and 54% (27/50) in the testing phase. SCPC and 
Fair crossovers also return insufficient results: around 
59% in the training phase and 54%-56% in the testing 
phase, although, as shown in Figure 3, SCPC has a 
higher convergence speed. Finally, note the similarity 




Table 1. Some knowledge base fuzzy rules output by two GGGP systems 



Crossover operator 


Rulel 


Rule 2 


WX 


IF NOT (type=branched) OR (number=few) 
THEN (prognosis=benign) 




Proposed 


IF NOT (age=middle) AND 
NOT (location=subaerolar) 
AND NOT(type=oval) THEN 
(prognosis=malignant) 


IF (type=heterogeneous) THEN 
(prognosis=malignant) 



Table 2. Average number of correctly classified patterns and unsuccessful runs 



Crossover operator 


Training 


Testing 


Unsuccessful runs 


KX 


181/315(57.46%) 


27/50 (54%) 


36% 


SCPC 


186/315(59.04%) 


28/50 (56%) 


14% 


Fair 


185/315(58.73%) 


27/50 (54%) 


15% 


WX 


191/315(60.63%) 


30/50 (60%) 


8% 


Proposed 


191/315(60.63%) 


31/50(62%) 


2% 
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between WX and the proposed operator. However, the 
proposed operator has higher speed of convergence 
and is less likely to get trapped in local optima, as it 
converged prematurely only twice in 100 executions. 



can choose any node from the second parent to gener- 
ate the offspring, rather than just those nodes with the 
same non-terminal symbols as the one chosen in the 
first parent. 
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KEY TERMS 

Ambiguous Grammar: Any grammar in which 
different derivation trees can generate the same sen- 
tence. 

Closure Problem: Phenomenon that involves al- 
ways generating syntactically valid individuals. 

Code Bloat: Phenomenon to be avoided in a genetic 
programming system convergence process involving the 
uncontrolled growth, in terms of size and complexity, 
of individuals in the population 

Convergence:Process by means of which an algo- 
rithm (in this case an evolutionary system) gradually 
approaches a solution. A genetic programming system 
is said to have converged when most of the individuals 
in the population are equal or when the system cannot 
evolve any further. 

Fitness: Measure associated with individuals in an 
evolutionary algorithm population to determine how 
good the solution they represent is for the problem. 

Genetic Programming: A variant of genetic al- 
gorithms that uses simulated evolution to discover 
functional programs to solve a task. 

Grammar-Guided Genetic Programming: The 

application of analytical methods and tools to data 
for the purpose of identifying patterns, relationships 
or obtaining systems that perform useful tasks such 
as classification, prediction, estimation, or affinity 
grouping. 

Intron: Segment of code within an individual 
(subtree) that does not modify the fitness, but is on the 
side of convergence process. 
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INTRODUCTION 

It is well accepted that in many real life situations in- 
formation is not certain and precise but rather uncertain 
or imprecise. To describe uncertainty probability theory 
emerged in the 1 7th and 1 8th century. Bernoulli, Laplace 
and Pascal are considered to be the fathers of probability 
theory. Today probability can still be considered as the 
prevalent theory to describe uncertainty. 

However, in the year 1965 Zadeh seemed to have 
challenged probability theory by introducing fuzzy sets 
as a theory dealing with uncertainty (Zadeh, 1965). 
Since then it has been discussed whether probability 
and fuzzy set theory are complementary or rather com- 
petitive (Zadeh, 1995). Sometimes fuzzy sets theory is 
even considered as a subset of probability theory and 
therefore dispensable. Although the discussion on the 
relationship of probability and fuzziness seems to have 
lost the intensity of its early years it is still continuing 
today. However, fuzzy set theory has established itself 
as a central approach to tackle uncertainty. For a discus- 
sion on the relationship of probability and fuzziness the 
reader is referred to e.g. Dubois, Prade (1993), Ross et 
al. (2002) or Zadeh (1995). 

In the meantime further ideas how to deal with 
uncertainty have been suggested. For example, Pawlak 
introduced rough sets in the beginning of the eighties of 
the last century (Pawlak, 1 982), a theory that has risen 
increasing attentions in the last years. For a comparison 
of probability, fuzzy sets and rough sets the reader is 
referred to Lin (2002). 

Presently research is conducted to develop a Gen- 
eralized Theory of Uncertainty (GTU) as a framework 
for any kind of uncertainty whether it is based on 
probability, fuzziness besides others (Zadeh, 2005). 
Cornerstones in this theory are the concepts of in- 
formation granularity (Zadeh, 1979) and generalized 
constraints (Zadeh, 1986). 

In this context the term Granular Computing was 
first suggested by Lin (1998a, 1998b), however it still 
lacks of a unique and well accepted definition. So, 
for example, Zadeh (2006a) colorfully calls granular 



computing "ballpark computing" or more precisely "a 
mode of computation in which the objects of computa- 
tion are generalized constraints". 



BACKGROUND 

Humans often speak and think in words rather than in 
numbers. For example, in summer we say that it is hot 
outside rather than that is 35.32° Celsius. This means 
that we often define our information as an imprecise 
perception-based linguistic variable rather than as a 
precise measure-based number. The impreciseness 
in our formulation basically has four reasons (Zadeh, 
2005): 

1. Bounded ability of human sensors and computa- 
tional limits of the brain. (1) Our human sensors 
do not have the abilities of a laser based speed 
controller. So we cannot quantify the speed of a 
racing car as 252.18 km/h in Albert Park, Mel- 
bourne. However on the linguistic level we can 
define the car as fast. (2) Meost people cannot 
numerically calculate the exact race distance 
given by 5,303 km * 53 turns=307.574 km due 
to computational limits of their brains. However 
they probably estimate that it will be around 300 
km. 

2. Lack of numerical information. Melbourne is 
considered as a shopping paradise in Australia 
since there are countless shops. Maybe only local 
government knows the exact number of shops. 

3. Qualitative, non quantifiable information. Much 
information is provided rather qualitative than 
quantitative. If one describes the quality of a 
pizza in an Italian restaurant in Lygon Street in 
Melbourne's suburb Carlton only a qualitative, 
linguistic judgment like excellent or very good is 
possible. The judgment is hardly to be quantifiable 
(beside a technical counting of the olives or the 
weight of the salami etc.). 
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4. Tolerance for imprecision. Recall the example, 
Melbourne as a shopping paradise, given above. 
To define Melbourne as shopping paradise its 
exact number of shops is not needed. It is suf- 
ficient to know that there are many shops. This 
tolerance for impression often makes a statement 
more robust and efficient in comparison to exact 
numerical values. 

So obviously humans often prefer not to deal with 
precise but favor vague information that is immanent 
in natural language. 

Humans would rarely formulate a sentence like: 

With a probability of 97.34% I will see Ken, who has 
a height of 1.97m, at 12:05pm. 

Instead most humans would prefer to say: 

Around noon I will almost certainly meet tall Ken. 

While the first formulation is computer compat- 
ible since it contains numbers (singletons) the second 
formulation seems too be to imprecise to be used as 
input for computers. 



A central objective of the concept of granular com- 
puting is to bridge this gap and compute with words 
(Zadeh, 1996). This leads to the ideas of information 
granularity or granular computing which was introduced 
by Zadeh (1986, 1979). 

The concept of information granularity has its roots 
in fuzzy set theory (Zadeh, 1965, 1997). Zadeh (1986) 
advanced and generalized this idea so that granular 
computing subsumes any kind of uncertainty and im- 
precision like "set theory and interval analysis, fuzzy 
sets, rough sets, shadowed sets, probabilistic sets and 
probability [...], high level granular constructs" (Bar- 
giela, Pedrycz, 2002, p. 5). The term granular computing 
was first suggested by Lin (1998a, 1998b). 



FUNDAMENTALS OF GRANULAR 
COMPUTING 

Singular and Granular Values 

To more formally describe the difference between 
natural language and precise information let us recall 
the example sentences given in Section 2. The infor- 




Figure 1. Mapping of Singletons and granular values 



With a probability of 97.34% I will see Ken, who has a height of 1.97m, at 12:05pm. 
k j \ ) \ ) 




r~ -^ r >* rS 

Around noon I will almost certainly meet tall Ken. 



Table 1. Singular and granular values 



Variable 


Singular Values 


Granular Values 


Probability 


97.34% 


almost certainly 


Height 


1.97m 


tall 


Time 


12:05pm 


around noon 
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mation given in the two sentences can be mapped as 
depicted in Figure 1 . 

While the first sentence contains exact figures 
(singletons) the second sentence describes the same 
context using linguistic variables (granular values). 
A comparison of the singular and granular values is 
given in Table 1. 

For example, the variable height can be mapped to 
the singleton 1.97m or the granule tall. The granule tall 
covers not only the singleton 1.97m but also neighbor- 



hood values. See Figure 2 for an interval granulation of 
the singleton of the variable height; a fuzzy member- 
ship function (linguistic variable) would be another 
possibility for a granule of tall (see Figure 3). 

The main difference in the representation of the 
variable heights is entailed by a different formulation of 
the constraints. While the formulation as a singleton is 
of bivalence nature (height= 1 . 97m) a fuzzy formulation 
would contain memberships. This leads to the concept 
of generalized constraints. 



Figure 2. Presentation of variable height as Singleton and granule 
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Figure 3. Fuzzy memberships 
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Generalized Constraints 

Overview of Constraints 

The generalization of constraints is a central concept in 
granular computing. The main purpose is to make clas- 
sic constraints like e (member), = (equal); < (smaller) 
and > (greater) more flexible and therefore closer to 
the way humans think. In the following subsections we 
will discuss standard, primary and general constraints 
in more detail. 

Basic Concept of Generalized Constraints 

Standard Constraints. A standard constraint C is 
characterized by its bivalency (possibilistic of veris- 
tic) or probabilistic nature. Bivalent and probabilistic 
constraints do not have memberships degrees which 
indicate the degree of satisfaction of the constraint A: a 
variable X does or does not fulfill the standard constraint. 
Examples for bivalent constraints are: e (member), = 
(equal); < (smaller) and > (greater) besides others. 

Primary Constraints. Zadeh (2006a) suggested 
the following primary constraints: 

Possibilistic (r=blank) 
Probabilistic (r=p) 
Veristic (r=v) 

since they formulate the basic perceptions possibil- 
ity, likelihood and truth. In contrast to the standard 
constraints bivalency is no longer required for the pos- 
sibilistic and veristic constraints. Therefore standard 
constraints are included in the primary constraints. 

Applying the primary constraints to our example 
the second "Ken sentence" of Section 2 we get: 

Possibilistic Constraint (X is R): Ken is tall — » 

Height(Ken) is tall (see Dubois, Prade (1998) 

for semantics of fuzzy sets including possibility 

(Zadeh, 1978)). 

Probabilistic Constraint (X isp R): Actual arrival 

time (X) at meeting point — > Xisp N(ju, o 2 ) is e.g. 

normal distributed around the agreed meeting 

time |i. 

Veristic Constraint (X isv R) : Ken is at the meeting 

point at 12:05pm — > Present(Ken, meeting point) 

isv 12:05pm. 



Generalized Constraints. Further constraints in- 
clude (Zadeh, 2005) usuality (r=u), random set (r=rs), 
fuzzy graph (r=fg), bimodal (r=bm) and group (r=g). 
The set of general constraints consists of these and 
the primary constraints. So, formally a generalized 
constraints (GC) is given by (Zadeh, 2005): 

GC(X):XisrR 

with X the constrained variable and R the non-bivalent 
relation. In the term isr the letter r defines the semantics 
or the modality of the constraint as describe above. 

Generalized Constraint Language 

To formally describe generalized constraints Zadeh 
(2006b) suggests a Generalized Constraint Language 
(GCL). In Section 3.2.2 we already used the GCL in 
the presented example, e.g. the mapping: Ken is tall 
—> Height(Ken) is tall, which has the form 

p — > X isr R 

with p an expression in natural language. In this con- 
text Zadeh (2006b) defines the translation of natural 
language into GCL as precisiation. The precisiation 
can lead to v-precise and/or m-precise results: 

v-precisiation: a precise value is obtained, v- 
precisiation has s-precisiation (singleton), cg- 
precisiation (crisp granular) and g -precisiation 
(granular) as its modalities, s-precisiation leads 
to a singleton, while cg-precisiation leads to an 
crisp interval, g-precisiation is the most general 
form of precisiation and leads to fuzzy intervals, 
fuzzy graphs besides others, 
m-precisiation: a precise meaning is obtained, 
m-precisiation can further divided into the mo- 
dalities mm-precisiation (machine-oriented) and 
mh- precisiation (human-oriented). 

Examples: (1) Ken is between a and b meters tall 
is m-precise and since the variables a and b are not 
specified v-imprecise. (2) Ken is approximately c meters 
tall —> Ken is a meters tall is a s-precisiation. The term 
approximately c can also be abbreviated as c*. The star 
indicates that c is a granular value. 
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Figure 4. Rough sets 
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In contrast to precisiation granulation leads to 
an imprecisiation of the information. Obviously the 
translation Ken is 1.97m — ► Ken is c meters tall is a 
v-imprecisiation and Ken is c meters tall — > Ken is tall 
a m-imprecisiation. 

So for example, rough sets can be interpreted 
as cascading cg-imprecisiation. In rough set theory 
(Pawlak, 1 982) a set is described by a lower and upper 
approximation ( L A and U A respectively). The lower 
approximation is a subset of the upper approximation. 
While the objects in the lower approximation surely 
belong to the corresponding set the objects in a upper 
approximation might belong to the set. 

Therefore rough set theory provides an example of 
a cascading granulation: X I L A I U A (see Figure 4). 

Deduction Rules 

Principal Deduction Rules 

In this S ection we regard the term granular computing in 
its literally meaning: how to compute with granules and 
focus on principal deductions (Zadeh, 2005, 2006b): 

Conjunction 
• Projection 
Protagation 

For more details on deduction rules the reader is 
referred to Zadeh (2005, 2006b). 



Generalized Extension Principle 

One of the most fundamental theorem in fuzzy logic is 
the Extension Principle (Zadeh, 1975, Zimmermann, 
2001). Basically the Extension Principle defines how 
the memberships |i (y) of an endogenous variable 

Y=f(X) 

can be determined with X and Y singletons and ju x (X) 
given. A simple transformation ju (Y) = ju (f(X)) = ju x (X) 
does not generally provide a unique solution. Therefore, 
to obtain a unique solution, sup ju (f(X)) is taken. 

The Generalized Extension Principle (Zadeh, 2006a) 
establishes a relationship between 

Y*=f*(X*) 

Gr(Y) isr Gr(X) 

with 7*, X* and/^(9 granules. It can be considered as 
primary deduction rule since many others deduction 
rules can be derived from it (Zadeh, 2006b). 

Example 

Let us consider an example (Zadeh, 2005, 2006a, 
2006b): 

The following linguistic statement is given: 

Most Swedes are tall — ■> (Height (Swedes) are tall) is 
most. 

First let us specify 

Swedes are tall — > J X(h)ju tall (h)dh 

with X(h) the height density function and ju tall (h) the 
membership function for the linguistic variable tall. 

Second we have to apply the linguistic variable most 
to the expression Swedes are tall and obtain: 

Most (Swedes are tall) — ► // ( J X(h)ju u (h)dh ) 

As result we get a precise formulation of the given 
linguistic statement. 
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CONCLUSION AND FUTURE 
RESEARCH 

Granular Computing is a mighty framework to deal 
with uncertainty. Information granules can include 
probabilistic as well as possibilistic phenomena be- 
sides others. Therefore granular computing functions 
as a umbrella for them without competing with them. 
One core advantage is that is helps to bridge the gap 
between (imprecise) natural language and the precision 
that is immanent in computers etc. Presently Zadeh is 
promoting his idea towards a Generalized Theory of 
Uncertainty in many publications and presentations. 
In future the Generalized Theory of Uncertainty will 
probably be the dominant label for anything related to 
this topic . Since the Generalized Theory of Uncertainty 
is a young but rapidly emerging new branch in science 
future research will go in the direction of the general- 
ization of uncertainty concepts, e.g. from probabilistic 
and fuzzy clustering towards granular clustering. 
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KEY TERMS 

Fuzzy Set Theory: Fuzzy set theory was intro- 
duced by Zahed in 1965. The central idea of fuzzy set 
theory is that an object belongs to more than one sets 
simultaneously, the closeness of the object to a set is 
indicated by membership degrees. 

Generalized Theory of Uncertainty (GTU) : GTU 

is a framework that shall subsume any kind of uncer- 
tainty (Zadeh 2006a). The core idea is to formulate 
generalized constraints (like possibilistic, probabilistic, 
veristic etc.). The objective of GTU is not to replace 
existing theories like probability or fuzzy sets but to 
provide an umbrella that allows to formulate any kind 
of uncertainty in a unique way. 

Granular Computing: The idea of granular com- 
puting goes back to Zadeh (1979). The basic idea of 
granular computing is that an object is describe by a 
bunch of values in possible dimensions like indistin- 
guishability, similarity and proximity. If a granular is 
labeled by a linguistic expressing it is called a linguistic 
variable. Zahed (2006a) defines granular computing 
as "a mode of computation in which the objects of 
computation are generalized constraints". 



Hybridization: Combination of methods like 
probabilistic, fuzzy, rough concepts, or neural nets, 
e.g. fuzzy-rough, rough-fuzzy or probabilistic-rough, 
or fuzzy-neural approaches. 

Linguistic Variable: A linguistic variable is a 
linguistic expression (one or more words) labeling an 
information granular. For example a membership func- 
tion is labeled by the expressions like "hot temperature" 
or "rich customer". 

Membership Function: A membership function 
shows the membership degrees of a variable to a cer- 
tain set. For example, a temperature t=30° C belongs 
to the set "hot temperature" with a membership degree 
X HT (30°)=0.8. The membership functions are not objec- 
tive but context and subject-dependent. 

Rough Set Theory: Rough set theory was intro- 
duced by Pawlak in 1982. The central idea of rough 
sets is that some objects distinguishable while others 
are indiscernible from each other. 

Soft Computing: In contrast to "hard computing" 
soft computing is collection of methods (fuzzy sets, 
rough sets neutral nets etc.) for dealing with ambigu- 
ous situations like imprecision, uncertainty, e.g. human 
expressions like "high profit at reasonable risks". The 
obj ective of applying soft computing is to obtain robust 
solutions at reasonable costs. 
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INTRODUCTION 

Currently, there exist many research areas that produce 
large multi variable datasets that are difficult to visualize 
in order to extract useful information. Kohonen self- 
organizing maps have been used successfully in the 
visualization and analysis of multidimensional data. 
In this work, a projection technique that compresses 
multidimensional datasets into two dimensional space 
using growing self-organizing maps is described. With 
this embedding scheme, traditional Kohonen visualiza- 
tion methods have been implemented using growing 
cell structures networks. New graphical map displays 
have been compared with Kohonen graphs using two 
groups of simulated data and one group of real multi- 
dimensional data selected from a satellite scene. 



BACKGROUND 

Data mining first stage usually consist of building 
simplified global overviews of data sets, generally in 
graphical form (Tukey, 1977). At present, the huge 
amount of information and its multidimensional 
nature complicates the possibility to employ direct 
graphic representation techniques. Self-Organizing 
Maps (Kohonen, 1982) fit well in the exploratory data 
analysis since its principal purpose is the visualization 
and the analysis of nonlinear relations between multi- 
dimensional data (Rossi, 2006). In this sense, a great 
variety of Kohonen's SOM visualization techniques 
(Kohonen, 200 1 ) (Ultsch & Siemon, 1 990) (Kraaij veld, 



Mao & Jain, 1995) (Merlk & Rauber, 1997) (Rubio & 
Gimenez 2003) (Vesanto, 1999), and some automatic 
map analysis (Franzmeier, Witkowski & Riickert 2005) 
have been proposed. 

In Kohonen's SOM the network structure has to 
be specified in advance and remains static during the 
training process. The choice of an inappropriate network 
structure can degrade the performance of the network. 
Some growing self-organizing maps have been imple- 
mented to avoid this disadvantage. In (Fritzke, 1994), 
Fritzke proposed the Growing Cell Structures (GCS) 
model, with a fixed dimensionality associated to the 
output map. In (Fritzke, 1995), the Growing Neural 
Gas is exposed, a new SOM model that learns topology 
relations . Even though the GNG networks get best grade 
of topology preservation than GC S networks, due to the 
multidimensional nature of the output map it cannot be 
used to generate graphical map displays in the plane. 
However, using the GCS model it is possible to create 
networks with a fixed dimensionality lower or equal 
than 3 that can be projected in a plane (Fritzke, 1994). 
GCS model, without removal of cells, has been used to 
compress biomedical multidimensional data sets to be 
displayed as two-dimensional colour images (Walker, 
Cross & Harrison, 1999). 



GROWING CELL STRUCTURES 
VISUALIZATION 

This work studies the GCS networks to obtain an em- 
bedding method to project the bi-dimensional output 
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map, with the aim of generating several graphic map 
displays for the exploratory data analysis during and 
after the self-organization process. 

Growing Cell Structures 

The visualization methods presented in this work are 
based on self-organizing map architecture and learning 
process of Fritzke's Growing Cell Structures (GCS) 
network (Fritzke, 1994). GCS network architecture 
consists of connected units forming k-dimensional 
hypertetrahedron structures linked between them. 
The interconnection scheme defines the neighbour- 
hood relationships. During the learning process, new 
units are added and superfluous ones are removed, but 
these modifications are performed in such way that the 
original architecture structure is maintained. 

The training algorithm is an iterative process that 
performs a non-linear projection of the input data over 
the output map, trying to preserve the topology of the 
original data distribution. The self-organization pro- 
cess of the GCS networks is similar that in Kohonen's 
model. For each input signal the best matching unit 
(bmu) is determined, and bmu and its direct neighbour 's 
synaptic vectors are modified. In GCS networks each 
neuron has associated a resource, which can represent 
the number of input signals received by the neuron, or 
the summed quantization error caused by the neuron. 
In every adaptation step the resource of the bmu is 
conveniently modified. A new neuron is inserted be- 
tween the unit with highest resource, q, and its direct 
neighbour with the most different reference vector, f, 
after a fixed number of adaptation steps. The new unit 
synaptic vector is interpolated from the synaptic vec- 
tors of q and f, and the resources values of q and fare 
redistributed too. In addition, neighbouring connections 
are modified in order to ensure the output architecture 
structure. Once all the training vectors have been pro- 
cessed a fixed number of times (epoch), the neurons 
whose reference vectors fall into regions with a very 
low probability density are removed. To guarantee the 
architecture structure some neighbouring connections 
are modified too. Relative normalized probability 
density estimation value proposed in (Delgado, 2004) 
has been used in this work to determine the units to 
be removed. This value provides better interpretation 
of some training parameters, improving the removal 
of cells and the topology preserving of the network. 



Several separated meshes could appear in the output 
map when superfluous units are removed. 

When the growing self-organization process fin- 
ishes, the synaptic vectors of the output units along with 
the neighbouring connections can be used to analyze 
different input space properties visually. 

Network Visualization: Constructing the 
Topographic Map 

The ability to project high-dimensional input data 
onto a low-dimensional grid is an important property 
of Kohonen feature maps. By drawing the output map 
over a plane it will be possible to visualize complex 
data and discover properties or relations of the input 
vector space not expected in advance. Output layer of 
Kohonen feature maps can be printed on a plane easily, 
painting a rectangular grid, where each cell represents 
an output neuron and neighbour cells correspond to 
neighbour output units. 

GCS networks have less regular output unit connec- 
tions than Kohonen ones. When k=2 architecture factor 
is used, the GCS output layer is organized in groups 
of interconnected triangles. In spite of bi-dimensional 
nature of these meshes, it is not obvious how to embed 
this structure into the plane in order to visualize it. In 
(Fritzke, 1994), Fritzke proposed a physical model to 
construct the bi-dimensional embedding during the 
self-organization process of the GCS network. Each 
output neuron is modelled by a disc, with diameter d, 
made of elastic material. Two discs with distance d 
between centres touch each other, and two discs with 
distance smaller than d repeal each other. Each neigh- 
bourhood connection is modelled as an elastic string. 
Two discs connected but not touching are pulled each 
other. Finally, all discs are positively charged and re- 
peal each other. Using this model, the bi-dimensional 
topographic coordinates of each output neuron can be 
obtained, and thus, the bi-dimensional output meshes 
can be printed on a plane. 

In order to obtain the output units bi-dimensional 
coordinates of the topographic map (for /c=2), a slightly 
modified version of this physical model has been used 
in this contribution. At the beginning of the training 
process, the initial three output neurons are placed in 
the plane in a triangle form. Each time a new neuron 
is inserted, its position in the plane is located exactly 
halfway of the position of the two neighbouring neurons 
between which it has been inserted. After this, attraction 
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and repulsion forces are calculated for every output 
neuron and its positions are consequently moved. The 
attraction force of a unit is calculated as the sum of 
individual attraction forces that all neighbouring con- 
nections exercise over it. Attraction force between two 
neighbouring neurons i and j, withp.andp. coordinates 
in the plane, and Euclidean distance e, is calculated as 
(e-d)/2 ife>d, and otherwise. The repelling force of 
a unit is calculated as the sum of individual repulsion 
forces that all no-neighbouring output neurons exercise 
over it. Repelling force between two no-neighbouring 
neurons i and j is calculated as d/5 if2d<e<3d, d/2 if 
d<e<2d, d if 0<e<d, and otherwise. There exist three 
basic differences between the embedding model used 
in this work and the Fritzke's one. First, repelling force 
is only calculated with no-neighbouring units. Second, 
attracting force between two neurons i and j is multi- 
plied by the distance normalization ((p -p.)/e) and by 
the attraction factor 0.1 (instead of 1). Last, repelling 
force between two neurons i and j is multiplied by the 
distance normalization ((p.-p.)/e) and by the repulsion 
factor 0.05 (instead of 0.2). ' 

The result of applying this projection method is 
showed in Fig. 1 . When removal of cells is performed, 



different meshes are showed unconnectedly. Without 
any other additional information, this proj ection method 
makes possible cluster detection. 

Visualization Methods 

Using the projection method exposed, traditional Ko- 
honen visualization methods can be implemented using 
GCS networks with k=2. Each output neuron is painted 
as a circle in a colour determined by a maj or parameter. 
When greyscale is used, normally dark and clear tones 
are associated with high and low values respectively. 
The grey scales are relative to the maximum and mini- 
mum values taken by the parameter. The nature of the 
data used to calculate the parameter determines three 
general types of methods for performing visual analysis 
of self-organizing maps: distances between synaptic 
vectors, training patterns projection over the neurons, 
and individual information about synaptic vectors. 

All the experiments have been performed using 
two groups of simulated data and one group of real 
multidimensional data (Fig. 2) selected from a scene 
registered by the ETM+ sensor (Landsat 7). The input 
signals are defined by the six ETM+ spectral bands with 




Figure 1. Output mesh projection during different self -organization process stages of a GCS network trained 
with bi-dimensional vectors distributed on eleven separate regions. 








Figure 2. (a) Eleven separate regions in the bi-dimensional plane, (b) Two three dimensional chain-link, (c) 
Projection of multidimensional data of satellite image. 
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the same spatial resolution: TM1 to TM5, and TM7. 
The input data set has a total number of 1800 pixels, 
1 500 carefully chosen from the original scene and 300 
randomly selected. The input vectors are associated to 
six land cover categories. 

Displaying Distances 

The adaptation process of GCS networks places the 
synaptic vectors in regions with high probability den- 
sity, removing units positioned into regions with a very 
low probability density. A graphical representation of 
distances between the synaptic vectors will be a useful 
tool to detect clusters over the input space. Distance 
map, unified distance map (U-map), and distance addi- 
tion map have been implemented to represent distance 
map information with GCS networks. 

In distance map, the mean distance between the 
synaptic vector of each neuron and the synaptic vec- 
tors of all its direct neighbours is calculated. U-map 
represents the same information than distance map 
but, in addition it includes the distance between all the 
neighbouring neurons (painted in a circle form between 
each pair of neighbour units). Finally, the sum of the 
distance between the synaptic vector of a neuron and 
the synaptic vectors of the rest of units is calculated, 
when distance addition map is generated. In distance 
map and U-map, dark zones represent clusters and clear 
zones boundaries along with them. In distance addi- 
tion map, neurons with near synaptic vectors appear 
with similar colour, and boundaries can be detected 
analyzing the regions where a considerable colour 



variation exists. Using GCS networks, separated meshes 
represent different input clusters, usually. Fig. 3 shows 
an example of these three graphs, compared with the 
traditional Kohonen's maps, when an eleven separate 
regions distribution data set is used. GCS network 
represents eleven clusters in the three graphs, clearly. 
Distance map and U-map in Kohonen's network show 
the eleven clusters too, but in distance addition map it 
is not possible to distinguish them. 

Displaying Projections 

This technique takes into account the input distribu- 
tion patterns to generate different values to assign to 
each neuron. For GCS networks, data histograms and 
quantization error maps have been implemented. 

Generating the histogram, the number of training 
patterns associated to each neuron is obtained. However, 
when quantization error graph has to be produced, the 
sum of the distances between the synaptic vector of a 
neuron and the input vectors that lies in its Voronoi re- 
gion is calculated. In both graphs, dark and clear zones 
correspond with high and low probability density areas, 
respectively, so it can be used in cluster analysis. Fig. 4 
shows an example of these two methods compared with 
those obtained using Kohonen 's model when chain-link 
distribution data set is used. Using Kohonen's model is 
difficult to distinguish the number of clusters present 
in the input space. On the other hand, GCS model has 
generated three output meshes, two of them represent- 
ing one ring. 



Figure 3. From left to right: distance map, U-map (unified distance map), and distance addition map when an 
eleven separate regions distribution data set is used, (a) Kohonen feature map with 10x10 grid of neurons, (b) 
GCS network with 100 output neurons. The right column shows the input data and the network projection using 
the two component values of the synaptic vectors. 
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Displaying Components 

The displaying components technique analyzes each 
synaptic vector or reference vector component in an 
individual manner. This kind of graphs offers a visual 
analysis of the topology preserving of the network, and 
a possible detection of correlations and dependences 
between training data components. Direct visualization 
of synaptic vectors and component planes graphs have 
been implemented for GCS networks. 

Direct visualization map represents each neuron 
in a circle form within its synaptic vector inside in a 
graphical manner. This graph can be complemented with 
anyone of described in the previous sections, enriching 



its interpretation. A component plane map visualizes an 
individual component of all the synaptic vectors. 

When all the component planes are generated, re- 
lations between weights can be appreciated if similar 
structures appear in identical places of two different 
component planes. Fig. 5 shows an example of these two 
displaying methods when multi-band data of satellite 
image is used. The direct visualization map shows the 
similarity between neighbouring units synaptic vectors, 
and, it is interesting distinguish the fact that all the 
neurons in a cluster have similar synaptic shapes. Fur- 
thermore, the integrated information about the distance 
addition map shows that there is no significant colour 
variation inside the same cluster. The six component 




Figure 4. From left to right: Unified distance map, data histograms and quantization error maps when chain-link 
distribution data set is used, (a) Kohonen feature map with 10x10 grid of neurons, (b) GCS network with 100 
output neurons. The right column shows the input data and the network projection using the three component 
values of the synaptic vectors. 
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Figure 5. GCS network trained with multidimensional data of satellite image, 54 output neurons. Graphs from 
(a) to (f) show the component planes for the six elements of the synaptic vectors, (g) Direct visualization map 
using distance addition map additional information. 
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plane graphs exhibit possible dependences involving 
TM1, TM2 and TM3 input vector components and, 
TM5 and TM7 components too. 



dimensional nature. We need to study the viability of 
cluster analysis with this projection technique when 
this class of data samples is used. 



Results 

Several Kohonen and GCS networks have been trained 
in order to evaluate and compare the resulting visualiza- 
tion graphs. For the sake of space only a few of these 
maps have been included here. Fig. 3 and Fig. 4 compare 
Kohonen and GCS visualizations using distance map, 
U-map, distance addition map, data histograms and 
quantization error map. It can be observed that GCS 
model offers much better graphical results in clusters 
analysis than Kohonen networks. The removal of 
units and connections inside low probability distribu- 
tion areas causes that GCS network presents within a 
particular cluster the same quality of information that 
Kohonen network in relation to the entire map. Since 
it has already been mentioned, the grey scale used in 
all the maps is relative to the maximum and minimum 
values taken by the studied parameter. In all the cases 
the range of values taken by the calculated factor using 
GCS is minor than using Kohonen maps. 

The exposed visualization methods applied to the 
visual analysis of multidimensional satellite data has 
given very satisfactory results (Fig 5). All trained GCS 
networks have been able to generate six sub maps in 
the output layer (in some case they have arrived up to 
eight) that identify the six land cover classes present 
in the sample of data. The direct visualization map 
and the component plane graphs have demonstrated 
to be a useful tool for the extraction of knowledge of 
the multisensorial data. 



FUTURE TRENDS 

The proposed knowledge visualization method based 
on GCS networks has results a useful tool for mul- 
tidimensional data analysis. In order to evaluate the 
quality of the trained networks we consider necessary 
to develop some measure techniques (qualitative and 
quantitative in numerical and graphical format) to 
analyze the topology preservation obtained. In this way 
we will be able to validate the information visualized 
by the methods presented in this paper. 

Also it would be interesting to validate these meth- 
ods of visualisation with new data sets of very high 



CONCLUSION 

The exposed embedding method allows multidimen- 
sional data to be displayed as two-dimensional grey 
images. The visual-spatial abilities of human observers 
can explore these graphical maps to extract interrela- 
tions and characteristics in the dataset. 

In GCS model the networks size does not have to 
be specified in advance. During the training process, 
the size of the network grows and decreases adapting 
its architecture to the particular characteristics of the 
training dataset. 

Although in GCS networks it is necessary to deter- 
mine a great number of training factors than in Kohonen 
model, using the learning modified model the tuning of 
the training factors values is simplified. In fact, several 
experiments have been made on datasets of diverse 
nature using the same values for all the training factors 
and giving excellent results in all the cases. 

Especially notable is the cluster detection during the 
self-organization process without any other additional 
information. 
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KEY TERMS 

Artificial Neural Networks: An interconnected 
group of units or neurons that uses a mathematical 
model for information processing based on a connec- 
tionist approach to computation. 

Data Mining: The application of analytical methods 
and tools to data for the purpose of identifying patterns, 
relationships or obtaining systems that perform useful 
tasks such as classification, prediction, estimation, or 
affinity grouping. 

Exploratory Data Analysis : Philosophy about how 
a data analysis should be carried out. Exploratory data 
analysis employs a variety of techniques (mostly graphi- 
cal) to extract the knowledge inherent to the data. 

Growing Cell Structures: Growing variant of the 
self-organizing map model, with the peculiarity of dy- 
namically adapts the size and connections of the output 
layer to the characteristics of the training patterns. 

Knowledge Visualization: The creation and com- 
munication of knowledge through the use of computer 
and non-computer-based, complementary, graphic 
representation techniques. 

Self-Organizing Map : A subtype of artificial neural 
network. It is trained using unsupervised learning to 
produce low dimensional representation of the training 
samples while preserving the topological properties of 
the input space 

Unsupervised Learning: Method of machine 
learning where a model is fit to observations. It is 
distinguished from supervised learning by the fact that 
there is no a priori output. 
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INTRODUCTION 

Unit Selection Text-to-Speech Synthesis (US-TTS) 
systems produce synthetic speech based on the retrieval 
of previous recorded speech units from a speech da- 
tabase (corpus) driven by a weighted cost function 
(Black & Campbell, 1995). To obtain high quality 
synthetic speech these weights must be optimized ef- 
ficiently. To that effect, in previous works, a technique 
was introduced for weight tuning based on evolution- 
ary perceptual tests by means of Active Interactive 
Genetic Algorithms (aiGAs) (Alias, Llora, Formiga, 
Sastry & Goldberg, 2006) aiGAs mine models that 
map subjective preferences from users by partial 
ordering graphs, synthetic fitness and Evolutionary 
Computation (EC) (Llora, Sastry, Goldberg, Gupta & 
Lakshmi, 2005). Although aiGA propose an effective 
method to map single user preferences, as far as we 
know, the methodology to extract common solutions 
among different individual preferences (hereafter 
denoted as common knowledge) has not been tackled 
yet. Furthermore, there is an ambiguity problem to be 
solved when different users evolve to different weight 
configurations. In this review, Generative Topographic 
Mapping (GTM) is introduced as a method to extract 
common knowledge from aiGA models obtained from 
user preferences. 



BACKGROUND 

Weight Tuning in Unit-Selection Text-to- 
Speech Synthesis 

The aim of US-TTS is to generate synthetic speech 
by concatenating the sequence of units that best fit the 
requirements derived from the input text. The speech 



units are retrieved from a database (speech corpus) 
which stores speech-units previously recorded by a 
professional speaker, typically. 

Text-to-speech workflow is generally modelled 
as two independent blocks that convert written text 
into speech signal. The first block is named Natural 
Language Processing (NLP), which is followed by the 
Digital Signal Processing block (DSP). At first stage, 
The NLP block carries out a text preprocessing (e.g. 
conversion of digit numbers or acronyms to words), then 
it converts graphemes to phonemes. And at last stage, 
the NLP block assigns quantified prosody parameters 
to each phoneme guiding the way each phoneme is 
converted to signal. Generally, this quantified prosody 
parameters involve duration, pitch and energy. Next, The 
DSP block retrieves from a recorded database (speech 
corpus) the sequence of units that best matches the 
target requirements (the phonemes and their prosody). 
Finally, the speech units are ensembled to obtain the 
output speech signal. 

The retrieval process is done by a dynamic pro- 
gramming algorithm (e.g. Viterbi or A* (Formiga 
& Alias, 2006)) driven by a cost function. The cost 
function computes the load of selecting a unit within 
a sequence as the sum of two weighted subcosts (see 
equation (1)): the target subcost (C f ) and the concat- 
enation subcost (C c ). In this work, the O is considered 
as a weighted linear combination of the normalized 
prosody distances between the target-NLP predicted 
prosody vector and the candidate unit prosody vector 
(see equation ). Otherwise, the C c is computed as a 
weighted linear combination of the distances between 
the feature vectors of the speech signal around its 
concatenation point (see equation ). 
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where f i represents the target units sequence {t v 
t 2 ,...,tj and u i represents the candidate units sequence 
{u v u 2 ,...,u n }. 
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Appropriate design of cost function by means of 
weight training is a crucial to earn high quality synthetic 
speech (Black, 2002). Nevertheless this concern has 
focused approaches with no unique response. Several 
techniques have been suggested for weight tuning, 
which maybe spitted into three families: i) manual-tun- 
ing ii) computationally-driven purely objective methods 
and Hi) perceptually optimized techniques (Alias, Llora, 
Formiga, Sastry & Goldberg, 2006). The present review 
is based on the techniques based on human feedback to 
the training process, following previous work (Alias, 
Llora, Formiga, Sastry & Goldberg, 2006), which is 
outlined in the next section. 

The Approach: Interactive Evolutionary 
Weight Tuning 

Computationally-driven purely objective methods are 
mainly focused on an acoustic measure (obtained from 
cepstral distances) between the resynthesized and the 
natural signals. Hunt and Black adopted two approaches 
in (Hunt & Black, 1 996). The first approach was based 
on adjusting the weights through an exhaustive search 
of a prediscretized weight space (weight space search, 
WSS). The second approach proposed by the authors 
used a multilinear regression technique (MLR), across 
the entire database to compute the desired weights. 
Later, Meron and Hirose (Meron & Hirose, 1999) pre- 
sented a methodology that improved the efficiency of 
the WSS and refined the MLR method. In a previous 



work (Alias & Llora, 2003), introduced evolutionary 
computation to perform this tuning. More precisely, 
Genetic Algorithms (GA) were applied to obtain the 
most appropriate weight. The main added value of 
making use of GA to find optimal weight configura- 
tion is the independency to linear search models (as in 
MLR)and, in addition, it avoids the exhaustive search 
(as in WSS). 

However, all this methods lack on its dependency 
on the acoustic measure to determine the actual quality 
of the synthesized speech, which in most part is rela- 
tive to human hearing. To obtain better speech qual- 
ity, it was suggested that user should take part in the 
process. In (Alias, Llora, Iriondo, Sevillano, Formiga 
& Socoro, 2004) there were conducted preference 
tests by synthesizing the training text according to two 
different weights and comparing the obtained speech 
subjective quality. Subsequently, Active Interactive 
Genetic Algorithms were presented in (Llora, Sastry, 
Goldberg, Gupta & Lakshmi, 2005) as one interactive 
evolutionary computation method where the user 
feedback evolves the solutions through survival-of- 
the-fittest mechanism. The solutions inherent fitness is 
based on the partial order provided by the evaluator; 
Active iGAs base its efficiency on evolving different 
solutions by means of surrogate fitness, which gener- 
alize the user preferences. This surrogate fitness and 
the evolutionary process are based on the following 
key elements: i) partial ordering, ii) induced complete 
order, and Hi) surrogate function via 8 Support Vector 
Machines (e-SVM). Preference decisions made by 
the user are modelled as a directional graph which 
is used to generate partial ordering of solutions (e.g: 
5q > x 2 ;x 2 > x 3 : x x -^ x 2 -^ x 3 ) (see figure 1). Table 1 
shows the approach of global rank based on dominance 
measure: given a vertex v, the number of dominated 
vertexes 5(v) and dominating vertexes is computed. 
Using this measures, the estimated fitness may be 
computed as f (v) = 6 (v) - H (v) . The estimated ranking 
r(v) is obtained by sorting based on f (v) (Llora, Sastry, 
Goldberg, Gupta & Lakshmi, 2005). The procedure of 
aiGA is detailed in algorithm 1. 

However, once the global weights were obtained 
with aiGA, there was no single dominant weight solu- 
tion (Alias, Llora, Formiga, Sastry & Goldberg, 2006), 
i.e. each test performed by different users gave similar 
and different solutions. This fact implied that a second 
group of users had to validate the obtained weights. 
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Figure 1. (Left) Partial evaluations allow building a directed graph among all solutions. (Right) Obtained graph 
must be cleared to avoid cycles and draws. 
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Table 1. Estimation of the global ranking based on the dominance measure. Dur TEne T and Pit T stand for the 
weight values for target weights (duration, energy and pitch). In the same way, Pit C Ene C andMfc C stand for 
the weight values for concatenation weights (Pitch, Energy and Mel Frequency Cepstrum). 
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Thus, clustering problem from different tests was suit- 
able to the weight tuning problem with the goal of 
extracting consistent results from the user tests. 



GENERATIVE TOPOGRAPHIC MAPPING 
BEING A PLUS 

GTM in a Nutshell 

Unsupervised learning allows to group sparse data into 
clusters in terms of similarity of data samples. Several 
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Algorithm 1 




Algorithm 1 Algorithm description of active iGA 



procedure aiGAQ 

1 -> Create an empty directed graph G . 

2 ~> Create 2 h random initial solutions ( S set). 

3 -> Create the hierarchical tournament set T using the available solutions in S . 

4 -> Present the tournaments in T to the user and update the partial ordering in G . 

5 -> Estimate f (v) for each veS. 

6 -> Train the surrogate 8 -SVM synthetic fitness based on S and r(v) . 

7 -> Optimize the surrogate 8 -SVM synthetic fitness with cGA. 

8 -> Create the S' set with 2 h_1 different solutions where S nS' = , sampling out of 

the probabilistic model evolved by cGA. 

9 "^ Create the hierarchical tournament set T" with 2 h -l tournaments using 2 

solutions in S and 2 /7 ~ 1 solucions in S' . 
10 -» S^SuS'. 

n -» r<=ruf. 

12 -> Go to 4 while not converged. 



h-l 



methods perform this grouping (Figuereido & Jain, 
2002): Expectation Maximization (EM), k-means, 
Gaussian Mixture Models (GMM), Self Organizing 
Maps (SOM) and Generative Topographic Mapping, 
among others. 

Techniques may be grouped, according to (Figuere- 
ido & Jain, 2002), into two types of formulation: i) 
model-based methods (e.g. GMM, EM, GTM) and 
ii) heuristic methods (e.g. k-means or hierarchical 
agglomerative methods). The number of sources gen- 
erating the data is the differential propriety. Indeed, 
model-based methods suppose that the observations 
have been fashioned by one (arbitrarily chosen and 
unidentified) source of a set of alternative arbitrary 
sources. Therefore, inferring these tuned sources and 
mapping the source to each observation leads to a clus- 
tering of the set of observations. Otherwise, heuristic 
methods assume only one source for the observed data 
considering similar heterogeneity for them. 

Self-Organizing Maps (or Kohonen maps) (Koho- 
nen, 1990) are a clustering technique based on neural 
networks. The easiness of visualizing of multidimen- 



sional data is the largely appropriate added value of 
SOM. In addition, Generative Topographic Map- 
ping is a nonlinear latent variable model introduced 
in (Bishop, Svensen & Williams, 1998). GTM intends 
to give an substitute answer to SOM by means of over- 
coming its restrictions which are listed in (Kohonen, 
2006): i) the absence of a cost function, ii) the lack of 
a theoretical basis for choosing learning rate parameter 
schedules and neighbourhood parameters to ensure 
topographic ordering, Hi) the absence of any general 
proofs of convergence and iv) the fact that the model 
does not define a probability density. 

GTM is based on a constrained mixture of GMM 
whose parameters can be tuned through EM algorithm. 
The handicap of heuristic based models is that there is 
not a-priori distribution of the centroids for each cluster. 
In GTM, the set of latent points is modelled as a grid. 
A circular gaussian distribution is a point in the grid 
with its equivalent correspondence, through a weighted 
non-linear basis functions, onto the multidimensional 
space. Thus, grid is shaped to wrap the data due to the 
explicit order among the gaussian distributions 
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Modelling User Preferences by GTM 

GTM is able to extract solutions from the different 
aiGA evolved graphs due to the consistency of its 
theoretical basis. The key objective is to recognize 
important clusters inside the evolved data space and 
therefore, determine the fitness entropy of each cluster 
in terms of fitness variance to choose the global weight 
configuration set. 

GTM can model the best alGA weights from multi- 
dimensional weight space into a two-dimensional space. 
Taking into account the cluster with higher averaged 
fitness and lower standard deviation allows selecting 
the best weight configuration from different user aiGA 
models. For adjusting this method the geometry of the 
gaussian distributions and the size of the latent space 
have to be set up manually. EM weights GTM cen- 
troids and the basis functions. Then, it is extracted from 
each cluster the average fitness as well as its standard 
deviation. The computation of the averaged fitness and 
standard deviation is computed from the set which its 
weight combinations bayesian a posteriori probability 
is the highest to the cluster. 



It is to note that the fitness itself does not get in- 
volved into the optimization EM part on behalf it is 
relative to each user and is not known for unevaluated 
weight combinations for one specific user (unless e- 
SVM predicted). 

Experiments and Results 

On (Formiga & Alias, 2007) common knowledge was 
extracted from user evolved weights from previous 
tests conducted on Catalan speech corpus with 9863 
recorded units (1207 diphones and triphones) (ob- 
tained from 1520 sentences and words) (Alias, Llora, 
Formiga, Sastry & Goldberg., 2006). On that test, 
five phonetically balanced sentences where extracted 
from the corpus to perform the global weight tun- 
ing process by a web interface named SinEvo (Alias, 
Llora, Iriondo, Sevillano, Formiga & Socoro, 2004). 
The evolved weights were normalized through Max- 
Min normalization to range all weights between and 
1. That test was conducted by three users, obtaining 
fifteen different weight configurations. 



Figure 2. Performance of GTM: Different par eto fronts were analyzed for each configuration 
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On (Formiga & Alias, 2007), different configura- 
tions of GTM were analyzed for mapping normalized 
weights (hexagonal or rectangular grid and different 
grid sizes: (3 x 3, 4 x 4, 5 x 5)). The purpose of this 
analysis was to find the optimal GTM configuration, 
i.e. The one which minimizes averaged standard de- 
viation (std) per cluster and the number of significant 
clusters per population (with averaged fitness over 
75%) while maximizing the averaged mean fitness per 
cluster. As it may be noticed in figure 2, the 4 x 4 grid 
configuration with hexagonal latent grid was selected 
as it yielded the best Pareto front (although the rest of 
4x4 grids achieved similar performance). 

After GTM was set up, each evolved weights were 
extracted and mapped to other users GTMs within the 
same sentence, obtaining their corresponding fitness 
from the other users preferences. Equation 6 allowed 
to set a global fitness (gF) from overall averaged fitness 
( F?™ ) for each evolved weight configuration. 



i u w 

™ /=1 7=1 



(6) 



where L/stands for the number of users, W for the 
number of weight configurations, N stands for the total 
number of weights (U + W). 

In addition, to avoid a perceptual manual validation 
stage ten different users-not involved whatsoever in the 
tuning process-performed a comparison to the alGAbest 
weights to allow a comparison between GTM clustering 
and real human preference on validation stage. 

Analyzing the results on figure 3, the GTM most 
voted weights configurations fit in with the manual user 
preferences for three sentences (De la Seva Selva, Del 
Seu Territori and Grans Extensions). Though, the rest 
of the sentences have quite different behaviour. The 
best weight combination selected from the users was 
the second GTM best weight configuration in I els 
han venut while the best GTM weight combination 
was never been voted. Cosine correlation is taken into 
account among problematic weights configurations as 
the important matter is weight distribution instead of 
analyzing the values themselves. In this case, GTM two 
better weights have a 0.7841 correlation, so GTM re- 
sults maybe measured satisfactory as weights approach 
equivalent patterns. By the other hand, the correlation 




Figure 3. The results of the comparison between normalized user voted preferences and GTM mapping are 
presented for the five sentences. Two different solutions for same user were considered if they adopted similar 
fitness in alGA model. 
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between the two best GTM weights configurations is 
0.8323 in Fusta de Birmdnia and, as in the previous 
case, the correlation gives again satisfactory results. 



FUTURE TRENDS 

Future work will be focused on conducting new ex- 
periments, e.g. by clustering similar units instead of 
tackling global weight tuning on preselected sentences 
or by including more users in the training process. In 
addition the expansion of the capabilities of GTM to 
map user preferences opens the possibility to focus 
on non-linear cost functions so as to overcome the 
linearity restrictions of the present function. 



CONCLUSIONS 

This article continues the work of including user pref- 
erences for tuning the weights of the cost function in 
Unit-selection TTS systems. In previous works we 
have presented a method to find cost function optimal 
weight tuning based on perceptual criteria of individual 
users. As a next step, this paper applies a heuristic 
method for choosing the best solution among all users 
overcoming the need to conduct a second listening test 
to select the best weight configuration among indi- 
vidual optimal solutions. This proof-of-principle study 
shows that GTM is capable of mapping the common 
knowledge among different users thanks to working 
on the perceptually optimized weights space obtained 
through aiGA and getting a final solution that can be 
used for a final adjustment of the TTS. 
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KEY TERMS 

Correlation: A statistical measurement of the 
interdependence or association between two or 
qualitative variables. A typical calculation would be 
performed by multiplying a signal by either another 
signal (cross-correlation) or by a delayed version of 
itself (autocorrelation). 

Digital Signal Processing (DSP): DSP, or Digital 
Signal Processing, as the term suggests, is the processing 
of signals by digital means. The processing of a digital 
signal is done by performing numerical calculations. 

Diphone: Asound consisting of two phonemes: one 
that leads into the sound and one that finishes the sound, 
e.g.: "hello" silence-h h-eh eh-1 1-oe oe-silence. 

Evolutionary Algorithms: Collective term for all 
variants of (probabilistic) optimization and approxima- 
tion algorithms that are inspired by Darwinian evolution. 
Optimal states are approximated by successive improve- 
ments based on the variation-selection-paradigm. 

Generative Topographic Mapping (GTM): Itisa 
technique for density modelling and data visualisation 
inspired in SOM (see SOM definition). 



Mel Frequency Cepstral Coefficients (MFCC): 

The MFCC are the coefficients of the Mel cepstrum. 
The Mel-cepstrum is the cepstrum computed on the 
Mel-bands (scaled to human ear) instead of the Fourier 
spectrum. 

Natural Language Processing (NLP): Computer 
understanding, analysis, manipulation, and/or genera- 
tion of natural language. 

Pitch: Intonation measure given a time in the 
signal. 

Prosody: A collection of phonological features 
including pitch, duration, and stress, which define the 
rhythm of spoken language. 

Text Normalization: The process of converting 
abbreviations and non-word written symbols into 
words that a speaker would say when reading that 
symbol out loud. 

Unit Selection Synthesis: A synthesis technique 
where appropriate units are retrieved from large da- 
tabases of natural speech so as to generate synthetic 
speech. 

Unsupervised Learning: Learning techniques that 
group instances without a pre-specified dependent at- 
tribute. Clustering algorithms are usually unsupervised 
methods for grouping data sets. 

Self-Organizing Maps: Self-organizing maps 
(SOMs) are a data visualization technique which reduce 
the dimensions of data through the use of self-organiz- 
ing neural networks 

Surrogate Fitness: Synthetic fitness measure that 
tries to evaluate one evolutionary solution in the same 
terms as one perceptual user would 
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INTRODUCTION 

Representing and consequently processing fuzzy data 
in standard and binary databases is problematic. The 
problem is further amplified in binary databases where 
continuous data is represented by means of discrete 
' 1 ' and '0' bits. As regards classification, the problem 
becomes even more acute. In these cases, we may want 
to group objects based on some fuzzy attributes, but 
unfortunately, an appropriate fuzzy similarity measure 
is not always easy to find. The current paper proposes 
a novel model and measure for representing fuzzy 
data, which lends itself to both classification and data 
mining. 

Classification algorithms and data mining attempt 
to set up hypotheses regarding the assigning of differ- 
ent objects to groups and classes on the basis of the 
similarity/distance between them (Estivill-Castro & 
Yang, 2004) (Lim, Loh & Shih, 2000) (Zhang & Srihari, 
2004). Classification algorithms and data mining are 
widely used in numerous fields including: social sci- 
ences, where observations and questionnaires are used 
in learning mechanisms of social behavior; marketing, 
for segmentation and customer profiling; finance, for 
fraud detection; computer science, for image process- 
ing and expert systems applications; medicine, for 
diagnostics; and many other fields. 

Classification algorithms and data mining meth- 
odologies are based on a procedure that calculates a 
similarity matrix based on similarity index between 
objects and on a grouping technique. Researches 
proved that a similarity measure based upon binary 
data representation yields better results than regular 
similarity indexes (Erlich, Gelbard & Spiegler, 2002) 
(Gelbard, Goldman & Spiegler, 2007). However, binary 
representation is currently limited to nominal discrete 
attributes suitable for attributes such as: gender, marital 



status, etc., (Zhang & Srihari, 2003). This makes the 
binary approach for data representation unattractive 
for widespread data types. 

The current research describes a novel approach 
to binary representation, referred to as Fuzzy Binary 
Representation. This new approach is suitable for all 
data types - nominal, ordinal and as continuous. We 
propose that there is meaning not only to the actual 
explicit attribute value, but also to its implicit similarity 
to other possible attribute values. These similarities can 
either be determined by a problem domain expert or au- 
tomatically by analyzing fuzzy functions that represent 
the problem domain. The added new fuzzy similarity 
yields improved classification and data mining results. 
More generally, Fuzzy Binary Representation and re- 
lated similarity measures exemplify that a refined and 
carefully designed handling of data, including eliciting 
of domain expertise regarding similarity, may add both 
value and knowledge to existing databases. 



BACKGROUND 
Binary Representation 

Binary representation creates a storage scheme, wherein 
data appear in binary form rather than the common 
numeric and alphanumeric formats. The database 
is viewed as a two-dimensional matrix that relates 
entities according to their attribute values. Having 
the rows represent entities and the columns represent 
possible values, entries in the matrix are either 'V or 
'0', indicating that a given entity (e.g., record, object) 
has or lacks a given value, respectively (Spiegler & 
Maayan, 1985). 

In this way, we can have a binary representation for 
discrete and continuous attributes. 
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Table 1. Standard binary representation table 
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Table 1 illustrates binary representation of a database 
consists of five entities with the following two attributes : 
Marital Status (nominal) and Height (continuous). 

Marital Status, with four values: S (single), M 
(married), D (divorced), W (widowed). 
Heights, with four values: 1.55, 1.56, 1.60 and 
1.84. 

However, practically, binary representation is cur- 
rently limited to nominal discrete attributes only. In the 
current study, we extend the binary model to include 
continuous data and fuzzy representation. 

Similarity Measures 

Similarity/distance measures are essential and at the 
heart of all classification algorithms. The most com- 
monly-used method for calculating similarity is the 
Squared Euclidean measure. This measure calculates 
the distance between two samples as the square root 
of the sums of all squared distances between their 
properties (Jain & Dubes, 1 988) (Jain, Murty & Flynn, 
1999). 

However, these likelihood-similarity measures are 
applicable only to ordinal attributes and cannot be used 
to classify nominal, discrete, or categorical attributes, 
since there is no meaning in placing such attribute values 
in a common Euclidean space. A similarity measure, 
which applicable to nominal attributes and used in our 
research is the Dice (Dice 1945). 

Additional binary similarity measures were de- 
veloped and presented (Illingworth, Glaser & Pyle, 
1983) (Zhang & Srihari, 2003). Similarities measures 
between the different attribute values, as proposed in 



Zadeh (1971) model, are essential in the classification 
process. 

In the current study we use similarities between enti- 
ties and between entity's attribute values to get better 
classification. Following former reserches, (Gelbard 
& Spiegler, 2000) (Erlich, Gelbard & Spiegler, 2002), 
the current study also uses Dice measure. 

Fuzzy Logic 

The theory of Fuzzy Logic was first introduced by 
Lotfi Zadeh (Zadeh, 1965). In classical logic, the 
only possible truth-values are true and false. In Fuzzy 
Logic; however, more truth-values are possible beyond 
the simple true and false. Fuzzy logic, then, derived 
from fuzzy set theory, is designed for situations where 
information is inexact and traditional digital on/off 
decisions are not possible. 

Fuzzy sets are an extension of classical set theory 
and are used in fuzzy logic. In classical set theory, 
membership of elements in relation to a set is assessed 
according to a clear condition; an element either belongs 
or does not belong to the set. By contrast, fiizzy set theory 
permits the gradual assessment of the membership of 
elements in relation to a set; this is described with the 
aid of a membership function fl — > [0, I]. An element 
mapped to the value means that the member is not 
included in the given set, ' 1' describes a fully included 
member, and all values between and 1 characterize 
the fuzzy members. For example, the continuous vari- 
able "Height" may have three membership functions; 
stand for "Short", "Medium" and "Tall" categories. 
An object may belong to few categories in different 
membership degree, e.g 180 cm. height may belong 
to the "Medium" and "Tall" categories, in different 
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membership degree expressed by the range [0,1]. The 
membership degrees are returned from the membership 
functions. We can say that a man whose height is 1 80 
cm. is "slightly medium" and a man whose height is 
200 cm. is of "perfect tall" height. 

Different membership functions might represent 
different membership degrees. Having several possibili- 
ties for membership functions is part of the theoretical 
and practical drawbacks in Zada's model. There is no 
"right way" to determine the right membership functions 
(Mitaim & Kosko, 200 1). Thus, a membership function 
may be considered arbitrary and subjective. 

In the current work, we make use of membership 
functions to develop the enhanced similarity calculation 
for use in classification of fuzzy data. 

FUZZY SIMILARITY REPRESENTATION 

Standard Binary Representation exhibits data integrity 
in that it is precise, and preserves data accuracy without 
either loss of information or rounding of any value. The 
mutual exclusiveness assumption causes the "isolation" 
of each value. This is true for handling discrete data 
values. However, in dealing with a continuous attribute, 
e.g. Height, we want to assume that height 1 .55 is closer 
to 1.56 than to 1.60. However, when converting such 
values into a mutually exclusive binary representation 
(Table 1), we lose these basic numerical relations. 
Similarity measures between any pair with different 
attribute values is always 0, no matter how similar the 
attribute values are. This drawback makes the standard 
binary representation unattractive for representing and 
handling continuous data types. 

Similarity between attribute values is also needed for 
nominal and ordinal data. For example, the color "red" 
(nominal value) is more similar to the color "purple" 
than it is to the color "yellow". In ranking (question- 



naires) a "1" satisfactory rank (ordinal variable) might 
be closer to the "2" rank than to the "5" rank. 

The absence of these similarity "intuitions" are of 
paramount importance in classification and indeed may 
cause some inaccuracies in classification results. 

The following sections present a model that adds 
relative similarity values to the data representation. This 
serves to empower the binary representation to better 
handle both continuous and fuzzy data and improves 
classification results for all attribute types. 

Model for Fuzzy Similarity 
Representation 

In standard binary representation, each attribute (which 
may have several values, e.g., color: red, blue, green, 
etc.) is a vector of bits where only one bit is set to "1" 
and all others are set to "0". The "1" bit stands for 
the actual value of the attribute. In the Fuzzy Binary 
Representation, the zero bits are replaced by relative 
similarity values. 

The Fuzzy Binary Representation is viewed as a 
two-dimensional matrix that relates entities according to 
their attribute values. Having the rows represent entities 
and the columns represent possible values, entries in the 
matrix are fuzzy numbers in the range [0,1], indicating 
the similarity degree of specific attribute value to the 
actual one, where ' V means full similarity to the actual 
value (this is the actual value), '0' means no similarity 
al all and all other values means partial similarity. 

The following example illustrates the way for cre- 
ating the Fuzzy Binary Representation: Let's assume 
we have a database of five entities and two attributes 
represented in a binary representation as illustrated in 
Table 1 . The fuzzy similarities between all attribute val- 
ues are calculated (next section describes the calculation 
process) and represented in a two-dimensional "Fuzzy 



Table 2. Fuzzy similarity matrixes of the marital status and height attributes 
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Similarity Matrix", wherein rows and columns stand 
for the different attributes' values, and the matrix cells 
contain the fuzzy similarity between the value pairs. 
The Fuzzy Similarity Matrix is symmetrical. Table 2 
illustrates fuzzy similarity matrixes for Marital Status 
and Height attributes. 

The Marital Status similarity matrix shows that 
the similarity between Single and Widow is "high" 
(0.8), while there is no similarity between Single and 
Married (0). The Height similarity matrix shows that 
the similarity between 1.56 and 1.60 is 0.8 ("high" 
similarity), while the similarity between 1.55 and 
1.84 is (not similar at all). These similarity matrixes 
can be calculated automatically, as is explained in the 
next section. 

Now, the zero values in the binary representation 
(Table 1) are replaced by the appropriate similarity 
value (Table 2). For example, in Table 1 , we will replace 
the zero-bit stands for Height 1.55 of the first entity, 
with the fuzzy similarity between 1.55 and 1.60 (the 
actual attribute value), as indicated in the Height fuzzy 
similarity matrix (0.7). Table 3 illustrates the fuzzy 
representation accepted after such replacements. 

It should be noted that the similarities indicated 
in the fuzzy similarity table relate to the similarity 
between the actual value of the attribute (e.g. 1.60 in 



entity 1) and the other attributes' values (e.g. 1.55, 
1.56 and 1.84). 

Next, the fuzzy similarities, presented in decimal 
form, are converted into a binary format - the Fuzzy 
Binary Representation. The conversion should allow 
similarity indexes like Dice. 

To meet this requirement, each similarity value is 
represented by N binary bits, where N is determined 
by the required precision. For one- tenth precision, 10 
binary bits are needed, for one-hundredth precision, 
100 binary bits are needed. For ten bits precision 
fuzzy similarity "0" will be represented by ten '0's, 
the fuzzy similarity "0.1" will be represented by nine 
'0' followed by one ' 1 ', the fuzzy similarity "0.2" will 
be represented by eight '0's followed by two 'l's and 
so on till the fuzzy similarity "1" which will be repre- 
sented by ten 'l's. Table 4 illustrates the conversion 
from fuzzy representation (Table 3) to fuzzy binary 
representation. 

The Fuzzy Binary Representation illustrated in Table 
4 is suitable for all data types (discrete and continuous) 
and, with the new knowledge (fuzzy similarities values) 
it contains, a better classification is expected. 

The following section describes the process for 
similarity calculations necessary for this type of Fuzzy 
Binary Representation. 




Table 3. Fuzzy similarity table 




Table 4. Fuzzy binary representation table 
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Fuzzy Similarity Calculation 

Similarity calculation between the different attribute 
values is not a precise science, i.e., there is no one way 
to calculate it, just as there is no one way to develop 
membership functions in the Fuzzy Logic world. 

We suggest determining similarities according to 
the attribute type. A domain expert should evaluate 
similarity for nominal attributes like "Marital Status". 
For example, Single, Divorced and Widowed are con- 
sidered "one person", while Married is considered as 
"two people". Therefore, Single may be more similar 
to Divorced and Widowed than it is to Married. On the 
other hand "Divorced" is one that once was married, 
so may be it is more similar to Married than to single. 
In short, similarity is a relative, rather than an absolute 
measure, as there is hardly any known automatic way 
to calculate similarities for such attributes and therefore 
a domain expert is needed. 

Similarity for ordinal data like satisfactory rank can 
be calculated in the same way as for nominal or con- 
tinuous attributes depending on the nature of attributes' 
values. Similarity for continuous data like Height can 
be calculated automatically. Unlike nominal attributes, 
in continuous data there is an intuitive meaning to the 
"distance" between different values. For example, as 
regards the Height attribute, the difference between 
1.55 and 1 . 5 6 is smaller than the distance between 1.55 
and 1.70; therefore, the similarity is expected to be 
higher accordingly. For continuous data, an automatic 
method can be constructed, as showed, to calculate the 
similarities. 

Depending on the problem domain, a continuous 
attribute can be divided into one or more fuzzy sets 
(categories), e.g., the Height attribute can be divided 
into three sets: Short, Medium and Tall. A membership 
function for each set can be developed. 

The calculated similarities depend on the specified 
membership functions; therefore, they are referred to 
here as fuzzy similarities. The following algorithm 
can be used for similarity calculations of continuous 
data: 

For each pair of attribute values (vl and v2) 
For each membership function F 

Similarities (vl, v2) = 1 - distance 
between F(vl) and F(v2) 
Similarity (vl, v2) = Maximum of 
the calculated Similarities 



Now that we have discussed both a model for 
Fuzzy Binary Representation and a way to calculate 
similarities, we will show the new knowledge (fuzzy 
similarities) added to the standard binary representa- 
tion improve the similarity measures between different 
entities, as discussed in the next section. 



COMPARING STANDARD AND FUZZY 
SIMILARITIES 

In this section, we compare standard and fuzzy similari- 
ties. The similarities were calculated according to the 
Dice index for the example represented in Table 4. 

Table 5 combines similarities of the different entities 
related to (a) Martial Status (nominal), to (b) Height 
(continuous) and to (c) both the Marital Status and 
Height attributes. 

Several points and findings arise from the repre- 
sentations shown above (Table 5). These are briefly 
highlighted below: 

1 . In our small example, a nominal attribute (Marital 
Status) represented in standard binary representa- 
tion cannot be used for classification. In contrast, 
the Fuzzy Binary Representation, with a large 
diversity of similarities results, will enable better 
classification. Grouping entities with a similarity 
that is equal to or greater than 0.7 yields a class 
of entities 2, 3, 4 and 5, which represent Single, 
Divorced and Widowed that belong to the set 
"one person". 

2 . For a continuous attribute (Height) represented in 
the standard binary representation, classification is 
not possible. In contrast, the Fuzzy Binary Repre- 
sentation with diversity in similarities results will, 
once again, enable better classification. Entities 
1 and 5 have absolute similarity (1), since for the 
Height attribute they are identical. Entities 2 and 
4 (similarity = 0.94) are very similar, since they 
represent the almost identical heights of 1.55 
and 1.56, respectively. Classification based on 
these two entities is possible due to diversity of 
similarities. 

3. The same phenomena presented for a single at- 
tribute (Marital Status or Height) exist also for 
the both attributes (Marital Status + Height) when 
are taking together. Similarity greater than 0.8 is 
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Table 5. Entities similarity 
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used to group entities 2, 4 and 5, which represent 
"one person" around 1.56 meters height. 

Two important advantages of the novel Fuzzy Bi- 
nary Representation detailed in the current work over 
the standard binary representation are suggested: (1) 
It is practically suitable to all attribute types. (2) It 
improves classification results. 



FUTURE TRENDS 

The current work improves classification by adding new 
similarity knowledge to the standard representation of 
data. Further research can be conducted to calculate the 
interrelationship between the different attributes, i.e., 
the cross-similarities among attributes such as marital 
status and height. Understanding such interrelationships 
might further serve to refine the classification and data 
mining results. 

Another worthwhile research direction is helping 
the human domain expert to get the "right" similarities, 
and thus choose the "right" membership functions. A 
Decision Support System may provide a way in which 
to structure the similarity evaluation of the expert and 
make his/her decisions less arbitrary. 



CONCLUSION 

In the current paper, the problems of representing and 
classifying data in databases were addressed. The focus 
was on Binary Databases, which have been shown in 
recent years to have an advantage in classification and 
data mining. Novel aspects for representing fuzziness 
were shown and a measure of similarity for fuzzy 
data was developed and described. Such measures are 
required, as similarity calculations are at the heart of 
any classification algorithm. Classification examples 
were illustrated. 

The evaluating of similarity measures shows that 
standard binary representation is useless when dealing 
with continuous attributes for classification. Fuzzy Bi- 
nary Representation reforms this drawback and results 
in promising classification based on continuous data 
attributes. In addition, adding fuzzy similarity was 
also shown to be useful for regular (nominal, ordinal) 
data to ensure better classification. Summarily, fuzzy 
representation improves classification results for all 
attribute types. 
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KEY TERMS 

Classification: The partitioning of a data set into 
subsets, so that the data in each subset (ideally) share 
some common traits - often proximity according to 
some defined similarity/distance measure. 

Data Mining: The process of automatically search- 
ing large volumes of data for patterns, using tools such 
as classification, association rule mining, clustering, 
etc. 

Database Binary Representation: Arepresentation 

where a database is viewed as a two-dimensional matrix 
that relates entities (rows) to attribute values (columns). 
Entries in the matrix are either ' V or '0\ indicating that 
a given entity has or lacks a given value. 

Fuzzy Logic: An extension of Boolean logic dealing 
with the concept of partial truth. Fuzzy logic replaces 
Boolean truth values (0 or 1 , black or white, yes or no) 
with degrees of truth. 

Fuzzy Set: An extension of classical set theory. 
Fuzzy set theory used in Fuzzy Logic, permits the 
gradual assessment of the membership of elements in 
relation to a set. 

Membership Function: The mathematical function 
that defines the degree of an element's membership in 
a fuzzy set. Membership functions return a value in the 
range of [0,1], indicating membership degree. 

Similarity: A numerical estimate of the difference 
or distance between two entities. The similarity values 
are in the range of [0,1], indicating similarity degree. 



802 



803 



Harmony Search for Multiple Dam Scheduling 



Zong Woo Geem 

Johns Hopkins University, USA 




INTRODUCTION 



BACKGROUND 



The dam is the wall that holds the water in, and the 
operation of multiple dams is complicated decision- 
making process as an optimization problem (Oliveira 
& Loucks, 1997). Traditionally researchers have used 
mathematical optimization techniques with linear 
programming (LP) or dynamic programming (DP) 
formulation to find the schedule. 

However, most of the mathematical models are valid 
only for simplified dam systems. Accordingly, during 
the past decade, some meta-heuristic techniques, such as 
genetic algorithm (GA) and simulated annealing (SA), 
have gathered great attention among dam researchers 
(Chen, 2003) (Esat & Hall, 1994) (Wardlaw & Sharif, 
1999) (Kim, Heo & Jeong, 2006) (Teegavarapu & 
Simonovic, 2002). 

Lately, another metaheuristic algorithm, harmony 
search (HS), has been developed (Geem, Kim & 
Loganathan, 2001) (Geem, 2006a) and applied to 
various artificial intelligent problems, such as music 
composition (Geem & Choi, 2007) and Sudoku puzzle 
(Geem, 2007). 

The HS algorithm has been also applied to various 
engineering problems such as structural design (Lee 
& Geem, 2004), water network design (Geem, 2006b), 
soil stability analysis (Li, Chi & Chu, 2006), satellite 
heat pipe design (Geem & Hwangbo, 2006), offshore 
structure design (Ryu, Duggal, Heyl & Geem, 2007), 
grillage system design (Erdal & Saka, 2006), and hydro- 
logic parameter estimation (Kim, Geem & Kim, 200 1 ). 
The HS algorithm could be a competent alternative to 
existing metaheuristics such as G A because the former 
overcame the drawback (such as building block theory) 
of the latter (Geem, 2006a). 

To test the ability of the HS algorithm in multiple 
dam operation problem, this article introduces a HS 
model, and applies it to a benchmark system, then 
compares the results with those of the GA model pre- 
viously developed. 



Before this study, various researchers have tackled the 
dam scheduling problem using phenomenon-inspired 
techniques. 

Esat and Hall (1994) introduced a GA model to the 
dam operation. They compared GA with the discrete 
differential dynamic programming (DDDP) technique. 
GA could overcome the drawback of DDDP which 
requires exponentially increased computing burden. 
Oliveira and Loucks (1997) proposed practical dam 
operating policies using enhanced GA (real-code chro- 
mosome, elitism, and arithmetic crossover). Wardlaw 
and Sharif (1 999) tried another enhanced GA schemes 
and concluded that the best GA model for dam opera- 
tion can be composed of real-value coding, tournament 
selection, uniform crossover, and modified uniform 
mutation. Chen (2003) developed a real-coded GA 
model for the long-term dam operation, and Kim et 
al. (2006) applied an enhanced multi-objective GA, 
named NSGA-II, to the real-world multiple dam sys- 
tem. Teegavarapu and Simonovic (2002) used another 
metaheuristic algorithm, simulated annealing (SA), to 
solve the dam operation problem. 

Although several metaheuristic algorithms have 
been already applied to the dam scheduling problem, 
the recently-developed HS algorithm was not applied to 
the problem before. Thus, this article deals with the HS 
algorithm's pioneering application to the problem. 



HARMONY SEARCH MODEL AND 
APPLICATION 

This article presents two major parts. The first part 
explains the structure of the HS model; and the second 
part applies the HS model to a bench-mark problem. 
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Dam Scheduling Model Using HS 

The HS model has the following formulation for the 
multiple dam scheduling. 

Maximize the benefits obtained by hydropower 
generation and irrigation 

Subject to the following constraints: 

1. Range of Water Release: the amount of water 
release in each dam should locate between 
minimum and maximum amounts. 

2. Range of Dam Storage: the amount of dam 
storage in each dam should locate between 
minimum and maximum amounts. 

3. Water Continuity: the amount of dam storage 
in next stage should be the summation of the 
amount in current stage, the amount of inflow, 
and the amount of water release. 

The HS algorithm starts with filling random schedul- 
ing vectors in the harmony memory (HM). The structure 
of HM for the dam scheduling is as follows: 



than the original amount R.(k) obtained from the HM. 
The summation of probability is equal to one (p x +p 2 

+ P 3 +P 4 =1). 

If the newly-generated vector, R NEW f is better than the 
worst harmony in the HM in terms of objective func- 
tion, the new harmony is included in the HM and the 
existing worst harmony is excluded from the HM. 

If the HS model reaches Maxlmp (maximum 
number of function evaluations), computation is termi- 
nated. Otherwise, another new harmony (= vector) is 
generated by considering one of three above-mentioned 
mechanisms. 

Applying HS to a Benchmark Problem 

The HS model was applied to a popular multiple dam 
system as shown in Figure 1 (Wardlaw & Sharif, 
1999). 

The problem has 12 two-hour operating periods, 
and only dam 4 has irrigation benefit because outflows 
of other dams are not directed to farms. The range of 
water releases is as follows: 



Rf 



r" 



Z(R X ) 
Z(R 2 ) 



r,HMS rys-rtHMS^ 
K N Z(K J 



(1) 



0.0 < R x < 3, 0.0 < R 2 , R 3 < 4, 0.0 < R 4 < 7 (3) 

The range of dam storages is as follows: 



Each row stands for each solution vector, and each 
column stands for each decision variable (water release 
amount in each stage and each dam). At the end of 
each row, the objective function value locates. HMS 
(harmony memory size) is the number of solution 
vectors is HM. 

Based on the initial HM, a new scheduling can be 
generated with the following function: 



R? 



w.p. Pi 

R^elRlR^^R™ 5 } w.p. p 2 

R t (k) + A w.p. p 3 

Ri(k)-A w.p. p 4 



(2) 



where R? EW is a new water release amount for decision 
variable z; the first row in the right hand side means 
that the new amount is chosen randomly from the total 
range; the second row means that the new amount is 
chosen from the HM; the third and fourth rows means 
that the new amount is certain unit (A) higher or lower 



Figure 1. Schematic of four dam system 
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0.0<S 1 ,S 2 ,S 3 <10,0.0<S 4 <15 



(4) ^(0), S 2 (0), S 3 (0), S 4 (0) = 5 



The initial and final storage conditions are as fol- 5^12), S 2 (12), S 3 (12) = 5, S 4 (12) = 7 
lows: 



(5) 
(6) 




Table 1. One example of optimal schedules by HS 



Time 


Daml 


Dam 2 


Dam 3 


Dam 4 





1.0 


4.0 


0.0 


0.0 


1 


0.0 


1.0 


0.0 


2.0 


2 


0.0 


2.0 


4.0 


7.0 


3 


2.0 


0.0 


4.0 


7.0 


4 


3.0 


3.0 


4.0 


7.0 


5 


3.0 


3.0 


4.0 


7.0 


6 


3.0 


4.0 


4.0 


7.0 


7 


3.0 


4.0 


4.0 


7.0 


8 


3.0 


4.0 


4.0 


7.0 


9 


3.0 


4.0 


4.0 


7.0 


10 


3.0 


4.0 


4.0 


0.0 


11 


0.0 


3.0 


0.0 


0.0 



Figure 2. Water release trajectory in each dam 



■ Dam 1 — ■— Dam 2 —A— Dam 3 —X— Dam 4 
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There are only two inflows: 2 units to dam 1; 3 
units to dam 2. 



I 1 = 2,I 2 = 3 



(7) 



Wardlaw and Sharif (1999) tackled this dam 
scheduling problem using an enhanced GA model 
(Population Size = 1 00; Crossover Rate = 0.70; Mutation 
Rate = 0.02; Number of Generations = 500; Number 
of Function Evaluations = 35,000; Binary, Gray, & 
Real-Value Representations; Tournament Selection; 
One-Point, Two-Point, & Uniform Crossovers; and 
Uniform and Modified Uniform Mutations). The GA 
model found a best near-optimal solution of 400.5, 
which is 99.8% of global optimum (401.3). 

The HS model was applied to the same problem 
with the following algorithm parameters: HMS = 30; 
HMCR = 0.95; PAR = 0.05; and Maxlmp = 35,000. 
The HS model could find five different global optimal 
solutions (HS1 - HS5) with identical cost of 401.3. 
Table 1 shows one example out of five optimal water 
release schedules. 

Figure 2 shows corresponding release trajectories 
in all dams. 

When the HS model was further tested with different 
algorithm parameter values, it found a better solution 
than that (400.5) of the GA model seven cases out of 
eight ones. 



FUTURE TRENDS 

From the success in this study, the future HS model 
should consider more complex dam scheduling 
problems with various real-world situations. 

Also, algorithm parameter guidelines obtained 
from considerable experiments on the values will be 
helpful to engineers in practice because meta-heuristic 
algorithms, including HS and GA, require lots of trials 
to obtain best algorithm parameters. 



CONCLUSION 

Music-inspired algorithm, HS, was successfully applied 
to the optimal scheduling problem of the multiple dam 
system, outperforming the results of GA. While the 
GA model obtained near-optimal solutions, the HS 



model found five different global optima under the 
same number of function evaluations. 

Moreover, the HS model did not perform sensitivity 
analysis of algorithm parameters while the GA model 
tested many parameter values and different operation 
schemes. This could reduce time and trouble in choosing 
parameter values in HS. 
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KEY TERMS 

Evolutionary Computation: Solution approach 
guided by biological evolution, which begins with 
potential solution models, then iteratively applies al- 
gorithms to find the fittest models from the set to serve 
as inputs to the next iteration, ultimately leading to a 
model that best represents the data. 

Genetic Algorithm: Technique to search exact or 
approximate solutions of optimization or search prob- 
lem by using evolution-inspired phenomena such as 
selection, crossover, and mutation. Genetic algorithm 
is classified as global search algorithm. 

Harmony Search: Technique to search exact or ap- 
proximate solutions of optimization or search problem 
by using music-inspired phenomenon (improvisation). 
Harmony search has three major operations such as 
random selection, memory consideration, and pitch 
adjustment. Harmony search is classified as global 
search algorithm. 

Metaheuristics : Technique to find solutions by com- 
bining black-box procedures (heuristics). Here, 'meta' 
means 'beyond', and 'heuristic' means 'to find'. 

Multiple Dam Scheduling: Process of developing 
individual dam schedule in multiple dam system. The 
schedule contains water release amount at each time 
period while satisfying release limit, storage limit, and 
continuity conditions. 

Optimization: Process of seeking to optimize 
(minimize or maximize) an objective function while 
satisfying all problem constraints by choosing the 
values of continuous or discrete variables. 

Soft Computing: Collection of computational 
techniques in computer science, especially in artificial 
intelligence, such as fuzzy logic, neural networks, chaos 
theory, and evolutionary algorithms. 
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INTRODUCTION 



BACKGROUND 



Neuro-fuzzy [Jang, 1 997] [Abraham,2005] are hybrid 
systems that combine the learning capacity of neural 
nets [Haykin, 1 999] with the linguistic interpretation of 
fuzzy inference systems [Ross,2004]. These systems 
have been evaluated quite intensively in machine learn- 
ing tasks. This is mainly due to a number of factors: 
the applicability of learning algorithms developed for 
neural nets; the possibility of promoting implicit and 
explicit knowledge integration; and the possibility of 
extracting knowledge in the form of fuzzy rules. Most 
of the well known neuro-fuzzy systems, however, 
present limitations regarding the number of inputs 
allowed or the limited (or nonexistent) form to create 
their own structure and rules [Nauck,1997][Nauck,19 
98] [Vuorimaa, 1 994] [Zhang, 1 995] . 

This paper describes a new class of neuro-fuzzy 
models, called Hierarchical Neuro-Fuzzy BSP Systems 
(HNFB). These models employ the BSP partition- 
ing (Binary Space Partitioning) of the input space 
[Chrysanthou,1996] and have been developed to by- 
pass traditional drawbacks of neuro-fuzzy systems. 
This paper introduces the HNFB models based on 
supervised learning algorithm. These models were 
evaluated in many benchmark applications related to 
classification and time-series forecasting. A second 
paper, entitled Hierarchical Neuro-Fuzzy Systems Part 
II, focuses on hierarchical neuro-fuzzy models based 
on reinforcement learning algorithms. 



Hybrid Intelligent Systems conceived by using tech- 
niques such as Fuzzy Logic and Neural Networks have 
been applied in areas where traditional approaches 
were unable to provide satisfactory solutions. Many 
researchers have attempted to integrate these two tech- 
niques by generating hybrid models that associate their 
advantages and minimize their limitations and deficien- 
cies. With this objective, hybrid neuro-fuzzy systems 
[Jang, 1 997] [Abraham,2005] have been created. 

Traditional neuro-fuzzy models, such as ANFIS 
[Jang, 1997], NEFCLASS [Nauck,1997] and FSOM 
[Vuorimaa, 1994], have a limited capacity for creating 
their own structure and rules [Souza,2002a]. Addi- 
tionally, most of these models employ grid partition 
of the input space, which, due to the rule explosion 
problem, are more adequate for applications with a 
smaller number of inputs. When a greater number of 
input variables are necessary, the system's performance 
deteriorates. 

Thus, Hierarchical Neuro-Fuzzy Systems have 
been devised to overcome these basic limitations. Dif- 
ferent models of this class of neuro-fuzzy systems have 
been developed, based on supervised technique. 



HIERARCHICAL NEURO-FUZZY 
SYSTEMS 

This section presents the new class of neuro-fuzzy 
systems that are based on hierarchical partitioning. 
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Two sub-sets of hierarchical neuro-fuzzy systems 

(HNF) have been developed, according to the learning 
process used: supervised learning models (HNFB [So 

uza,2002b][Vellasco,2004],HNFB 1 [Gongalves,2006], 
HNFB-Mamdani [Bezerra,2005]); and reinforcement 
learning models (RL-HNFB [Figueiredo,2005a], RL- 
HNFP [Figueiredo,2005b]). The focus of this paper is 
on the first sub-set of models, which are described in 
the following sections. 



HIERACHICAL NEURO-FUZZY BSP 
MODEL 

Basic Neuro-Fuzzy BSP Cell 

An HNFB cell is a neuro-fuzzy mini-system that 
performs fuzzy binary partitioning of the input 
space. The HNFB cell generates a crisp output after a 
defuzzification process. 

Figure 1 (a) illustrates the cell's functionality, where 
V represents the input variable; p(x) and \x(x) are the 
membership functions low and high, respectively, which 
generate the antecedents of the two fuzzy rules; andy 
is the crisp output. The linguistic interpretation of the 
mapping implemented by the HNFB cell is given by 
the following rules: 



If x e p theny = d 1 



Ifx 



\x then y = d 2 . 



Each rule corresponds to one of the two partitions 
generated by BSP. Each partition can in turn be subdi- 
vided into two parts by means of another HNFB cell. 

The profiles of membership functions p(x) and \x(x) 
are complementary logistic functions. 

The output y of an HNFB cell (defuzzification pro- 
cess) is given by the weighted average. Due to the fact 
that the membership function p(x) is the complement 
to 1 of the membership function \x(x), the following 
equation applies: 



y = p(x)*d 1 + ji(x)*d 2 or y = ZMi 



(1) 



where a symbolizes the firing level of the rule in parti- 
tion z and are given by: c^ = p(x); a 2 = \x(x). Each d. 
corresponds to one of the three possible consequents 
below: 



A singleton: The case where d. = constant. 
A linear combination of the inputs: 

d t = £w k x k +w 



where: x k is the system's /c-th input; the w k rep- 
resent the weight associated with the in put x k ; 
'n' is equal to the total number of inputs; and w Q 
corresponds to a constant value. 
The output of a stage of a previous level: The case 
where d. =y., where y. represents the output of a 
generic cell 'j\ whose value is also calculated by 
eq. (1). 

HNFB Architecture 

An HNFB model may be described as a system that 
is made up of interconnections of HNFB cells. Figure 
1 (b) illustrates an HNFB system along with the respec- 
tive partitioning of the input space. In this system, 
the initial partitions 1 and 2 ('BSPO' cell) have been 
subdivided; hence, the consequents of its rules are the 
outputs of BSP 1 and BSP2, respectively. In turn, these 
subsystems have, as consequents, values d u , y u , d 21 
and d 22 , respectively. Consequent y u is the output of 
the 'BSP 12' cell. The output of the system in figure 
1(b) is given by equation (2). 

y — vX i l vX ii Uii "1 r~ \Jh a j \\Jh ini Uini "1 r~ UC i 9 9 *** 1 9 9 ) ) * *-^ 9 \\A nilini "1 r~ \Jh 99*** 99 J 

(2) 

It must be stressed that, although each BSP cell 
divides the input space only in two fuzzy set (low 
and high), the complete HNFB architecture divides 
the universe of discourse of each variable in as many 
partitions as necessary. The number of partitions is 
determined during the learning process. In Figure 
1(c), for instance, the upper left part of the input space 
(partition 12 in gray) has been further subdivided by 
the horizontal variable x , resulting in three fuzzy sets 
for the complete universe of discourse of this specific 
variable. 

Learning Algorithm 

The HNFB system has a training algorithm based on 
the gradient descent method for learning the structure 
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Figure 1. (a) Interior of Neuro-Fuzzy BSP cell, (b) Example ofHNFB system, (c) Input space Partitioning of 
the HNFB system 
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of the model and, consequently, linguistic rules. The 
parameters that define the profiles of the membership 
functions of the antecedents and consequents are re- 
garded as fuzzy weights of the neuro-fuzzy system. 

In order to prevent the structure from growing indefi- 
nitely, a non-dimensional parameter, named decomposi- 
tion rate (8), was created. More details of this algorithm 
may be found in [Souza,2002b][Gon9alves,2006]. 

The results obtained in classification and time 
series forecasting problems are presented in the Case 
Studies section. 



HIERARCHICAL NEURO-FUZZY BPS 
FOR CLASSIFICATION 

The original HNFB provides very good results for 
function approximation and time series forecasting. 
However, it is not ideal for pattern classification ap- 
plications, since it has only one output and makes use of 
the Takagi-Sugeno inference method [Takagi,1985], 
which reduces the rule base interpretability. 

Therefore, a new hierarchical neuro-fuzzy BSP 
model dedicated to pattern classification and rule 
extraction, called the Inverted HNFB or HNFB 1 , has 
been developed, which is able to extract classification 
rules such as: Ifx is A and y is B then input-pattern 
belongs to class Z. This new hierarchical neuro-fuzzy 
model is denominated inverted because it applies the 
learning process of the original HNFB to generate the 
model's structure. After this first learning phase, the 



structure is inverted and the architecture of the HNFB -1 
model is obtained. The basic cell of this new inverted 
structure is described below. 

Basic Inverted-HNFB Cell 

Similarly to the original HNFB model, a basic In- 
verted-HNFB cell is a neuro-fuzzy mini-system that 
performs fuzzy binary partitioning in a particular 
space according to the same membership functions 
p and \a. However, after a defuzzification process, 
the Inverted-HNFB cell generates two crisp outputs 
instead of one. Fig. 2(a) illustrates the interior of the 
Inverted-HNFB cell. 

By considering that membership functions are 
complementary, the outputs of an HNFB 1 cell are given: 
y x = (3 *p(x) and y 2 = (3 *|i(x), where P corresponds to 
one of the two possible cases below: 

P=the input of the first cell: so (3 =1. 

|3=is the output of a cell of a previous level: so 

P=y., where y. represents one of the two outputs 
of a generic 'j' cell. 

Inverted-HNFB Architecture 

Fig. 2(b) presents an example of the original HNFB 
architecture obtained during the training phase of a 
database containing three distinct classes, while Fig. 
2(c) shows how the HNFB -1 model is obtained, after 
the inversion process. 
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In the HNFB 1 architecture shown in Fig. 2(c), it 
may be observed that the classification system has 
several outputs (y t toy 5 ), one for each existing leaf in 
the original HNFB architecture. The outputs of the leaf 
cells are calculated by means of the following equations 
(using complementary membership functions): 



y 5 = ivm 



(7) 



yi = Po-Pi 

y 2 =p 4i 1 .p 12 

y 3 = Po-m-i-m-12 
y 3 = P -Mi-M<i2 



(3) 
(4) 
(5) 
(6) 




where p. and \x. are the membership functions for the 
BSP. 

l 

HNFB- 1 System Outputs 

After the inversion has been performed, the outputs are 
connected to T-conorm cells (OR operator) that define 
the classes (see Fig. 2(d)). The initial procedure for 
linking the leaf cells to the T-conorm neurons consists 
of connecting all leaf cells with all T-conorm neurons. 
Once these connections have been made, it is necessary 
to establish their weights. For the purpose of assigning 



Figure 2. (a) HNFB 1 basic cell, (b) Original HNFB architecture, (c) Inversion of the architecture shown in Fig. 
2(b). (d) Connection of the inverted architecture to T-conorm cells 
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these weights, a learning method based on the Least 
Mean Squares [Haykin 1999] has been employed. 

After the weights have been determined, the T- 
conorm operation (Limited Sum T-conorm operator 
[Ross,2004]) is used for processing the output of the 
neuron. The final output of the HNFB 1 system is 
specified by the highest output obtained among all the 
T-conorm neurons, determining the class to which the 
input pattern belongs. 

Results obtained with the HNFB -1 model, in differ- 
ent benchmark classification problems, are presented 
in Case Studies section. 



HIERARCHICAL NEURO-FUZZY BPS 
MAMDANI 

The Hierarchical Neuro-Fuzzy B SP Mamdani (HNFB- 
Mamdani), as HNFB 1 , was also developed to enhance 
the interpretability of the hierarchical neuro-fuzzy 
systems. However, since the HNFB 1 is dedicated to 
classification problems, a more general model was 
devised. The HNFB -Mamdani employs Mamdani in- 
ference method [Jang, 1 997] in the rules ' consequents, 
and can be applied in control systems, pattern clas- 
sification, forecasting, and rule extraction. 



HNFB-Mamdani Architecture 

The HNFB-Mamdani architecture is formed by the 
interconnection of HNFB -1 cells in a binary tree struc- 
ture and is divided into three basic modules: input 
partitioning structure; weighted connection from the 
binary structure leaf cells (d.) to the T-conorm neurons 
(T); and the defuzzification process. 

The first two modules are identical to the HNFB -1 
architecture, except that each T-conorm neuron is as- 
sociated with a fuzzy set M of the consequent. All leaf 
cells are connected to all T-conorm neurons. To each 
connection there is a weight associated, whose value is 
also establish by the Least Mean Squares algorithm. 

The consequent of a fuzzy rule in the HNFB-Mam- 
dani model is a fuzzy set represented by a triangular 
membership function. The total number of fuzzy sets 
associated with the output variable is specified by the 
user. 

Defuzzification Method 

The defuzzification process selected for the HNFB- 
Mamdani model is the weighted average of the maxi- 
mum values. Figure 3 illustrates the defuzzification 
process for a model with three output fuzzy sets. 



Figure 3. Defuzzification process 



Mi M* Ma 




Leaf T-Conorm Fuzzy Defuzzification 

Cells Neurons Sets 



812 



Hierarchical Neuro-Fuzzy Systems Part I 



The output y is then calculated by Eq. (8). 



y> 



Za/*C, 

i=l 



(8) 



where: 



y. : output of the HNFB-Mamdani for input pattern 

jl 

a/: output value of the z'-th T-conorm neuron (T.) 

for input pattern j; 
C : value in the universe of discourse of the output 

variable where the M fuzzy set presents the 

maximum value; 
* : product operator. 
n : total number of fuzzy sets associated with the 

output variable. 



provide a superior performance than HNFB-Mamdani. 
This is due to the Takagi-Sugeno inference method 
used by those models, which is usually more accurate 
than the Mamdani inference method [Bezerra,2005] . 
The disadvantage of the Takagi-Sugeno method is its 
reduced interpretability. 

Electric Load Forecasting 

This experiment made use of data related to the monthly 
electric load of 6 utilities of the Brazilian electrical 
energy sector. 

The results obtained with the HNFB models were 
compared with Backpropagation algorithm, statistical 
techniques, such as the Holt-Winters anditox & Jenkins, 




CASE STUDIES 



Table 1. Comparison of the average classification 
performance 



In order to evaluate the performance of supervised 
HNFB models, two benchmark classification databases 
and six load time series from utilities of the Brazilian 
electrical energy sector were selected. 

Pattern Classification 

Pattern classification aims to determine to which 
group of a pre-determined set an input pattern be- 
long to. Two benchmark applications were selected 
among those most frequently employed in the area 
of machine learning. The results obtained with the 
proposed HNFB models were compared to the ones 
described in [Gongalves,2006]. In order to generate 
the training and test sets, the total set of patterns was 
randomly divided into two equal parts. Each of these 
two sets was alternately used either as a training or as 
a test set. Table 1 below summarizes the average clas- 
sification performance obtained with both test sets. 
The performance of the HNFB models is better than 
the other models, except for the HNFB-Mamdani case. 
Since HNFB-Mamdani is a general-purpose model, 
it tends to provide inferior results when compared to 
application-specific models, such as Inverted-HNFB 
andHNFB-Class. On the other hand, HNFB andHNFQ 
[Souza,2002a] are also general-purpose models but still 





Iris 


Wine 


NN 


- 


95.20 % 


KNN 


- 


96.70% 


FSS 


- 


92.80 % 


BSS 


- 


94.80 % 


MFS1 


- 


97.60% 


MFS2 


- 


97.90 % 


C4.5 


94.00 % 


- 


FID3.1 


96.00% 


- 


NEFCLASS 


96.00 % 


- 


HNFB1 


98.67 % 


97.80% 


HNFB2 


98.67 % 


97.80 % 


HNFQ 


98.67 % 


98.88 % 


HNFB-Inverted 


98.67 % 


99.44 % 


HNFB-Classl 


98.67 % 


98.87 % 


HNFB-Class2 


97.33 % 


98.88 % 


HNFB-Mamdani 


95,00 % 


95.77% 



where: NN=nearest-neighbor, KNN=k-nearest-neighbor, 
FSS=nearest-neighbor/forward sequential selection of fea- 
ture), BSS=nearest-neighb or /backward sequential selection 
of feature), MFS=Multiple Feature Subsets, C4.5, FID3.1, 
NEFCLASS, HNFB1 (fixed selection), HNFB2 (adaptive 
selection), HNFQ, Inverted-HNFB, HNFB-Classl (fixed 
selection) and HNFB -Class 2 (adaptive selection). References 
to all these models are provided in [Gongalves,2006] . 
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Table 2. Monthly load prediction errors (MAPE) for different models 





HNFB- 
Mamdani 


HNFB 


Back 

Propagation 


Box& 
Jenkins 


Holt- 
Winters 


RNB 

(Gaussian) 


RNB 
(MCMC) 


COPEL 


1,77% 


1,17% 


1.57% 


1.63% 


1.96% 


1.45% 


1.16% 


CEMIG 


1,39% 


1,12% 


1.47% 


1.67% 


1.75% 


1.29% 


1.28% 


LIGHT 


2,41% 


2,22 % 


3.57% 


4.02% 


2.73% 


1.44% 


2.23% 


FURNAS 


3,08% 


3,76 % 


5.28% 


5.43% 


4.55% 


1.33% 


3.85% 


CERJ 


2,79% 


1,35 % 


3.16% 


3.24% 


2.69% 


1.50% 


1.33% 


E.PAULO 


1,42% 


1,17% 


1.58% 


2.23% 


1.85% 


0.79% 


0.78% 



and with Bayesian Neural Nets (BNN) [Bishop, 1995], 
trained by Gaussian approximation and by the MCMC 
method. Table 2 below presents the performance results 
in terms of the "Mean Absolute Percentage Error". 

It can be observed that the general performance of 
HNFB models is usually superior to the results provided 
by statistical methods. The results obtained with BNNs 
are generally better than with HNFB models. However, 
according to [Tito, 1999], the training time with BNN 
was about 8 hours. This was a much longer period than 
the time required by the HNFB models to perform the 
same task, which was of the order of tens to hundreds 
of seconds, on similar equipment. Additionally, the data 
used in the HNFB models were not treated in terms of 
their seasonal aspects, nor were they made stationary 
as was the case of the BNN tested in [Tito, 1999]. 



FUTURE TRENDS 

As can be seen from the results presented, HNFB models 
provide very good performance in different applications. 
To improve the performance of the HNFB-Mamdani 
model, which provided the worst results among the 
supervised HNFB models, the model is being extended 
to allow the use of different types of output fuzzy sets 
(such as Gaussian, trapezoidal, etc.) and by adding 
an algorithm to optimize the total number of output 
fuzzy sets. 



CONCLUSION 

The objective of this article was to introduce a new 
class of neuro-fuzzy models which aims to improve 
the weak points of conventional neuro-fuzzy systems. 
The results obtained by the HNFB models showed 
that they yield a good performance as classifiers of 
database patterns or as time series forecasters. These 
models are able to create their own structure and allow 
the extraction of knowledge in the form of linguistic 
fuzzy rules. 
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KEY TERMS 

Artificial Neural Networks: Composed of sev- 
eral units called neurons, connected through synaptic 
weights, which are iteratively adapted to achieve the 
desired response. Each neuron performs a weighted sum 
of its inputs, which is then passed through a nonlinear 
function that yields the output signal. ANNs have the 
ability to perform a non-linear mapping between their 
inputs and outputs, which is learned by a training 
algorithm. 

Bayesian Neural Networks: Multi-layer neural 
networks that use training algorithms based on statistical 
Bayesian inference. BNNs offer a number of important 
advantages over the standard Backpropagation learn- 
ing algorithm including: confidence intervals can be 
assigned to the predictions generated by a network; 
they allow the values of regularization coefficients to 
be selected using only training data; similarly, they 
allow different models to be compared using only the 
training data dealing with the issue of model complexity 
without the need to use cross validation. 

Binary Space Partitioning: The space is succes- 
sively divided in two regions, in a recursive way. This 
partitioning can be represented by a binary tree that 
illustrates the successive n-dimensional space sub-di- 
visions in two convex subspaces. This process results 
in two new subspaces that can be later partitioned by 
the same method. 




815 



Hierarchical Neuro-Fuzzy Systems Part I 



Fuzzy Logic : Can be used to translate, in mathemati- 
cal terms, the imprecise information expressed by a set 
of linguistic IF-THEN rules. Fuzzy Logic studies the 
formal principles of approximate reasoning and is based 
on Fuzzy Set Theory. It deals with intrinsic imprecision, 
associated with the description of the properties of a 
phenomenon, and not with the imprecision associated 
with the measurement of the phenomenon itself. While 
classical logic is of a bivalent nature (true or false), 
fuzzy logic admits multivalence. 

Machine Learning: Concerned with the design and 
development of algorithms and techniques that allow 
computers to "learn". The major focus of machine 
learning research is to automatically extract useful 
information from historical data, by computational 
and statistical methods. 



Pattern Recognition: A sub-topic of machine learn- 
ing, which aims to classify input patterns into a specific 
class of pre-defined groups. The classification is usually 
based on the availability of a set of patterns that have 
already been classified. Therefore, the resulting learning 
strategy is based on supervised learning. 

Supervised Learning: A machine learning tech- 
nique for creating a function from training data, which 
consist of pairs of input patterns as well as the desired 
outputs. Therefore, the learning process depends on the 
existance of a "teacher" that provides, to each input 
pattern, the real output value. The output of the function 
can be a continuous value (called regression), or a class 
label of the input object (called classification). 
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INTRODUCTION 

This paper describes a new class of neuro-fuzzy models, 
called Reinforcement Learning Hierarchical Neuro- 
Fuzzy Systems (RL-HNF). These models employ the 
BSP (Binary Space Partitioning) and Politree parti- 
tioning of the input space [Chrysanthou, 1 992] and have 
been developed in order to bypass traditional drawbacks 
of neuro-fuzzy systems: the reduced number of al- 
lowed inputs and the poor capacity to create their own 
structure and rules (ANFIS [Jang, 1997], NEFCLASS 
[Kruse,1995] andFSOM [Vuorimaa,1994]). 

These new models, named Reinforcement Learn- 
ing Hierarchical Neuro-Fuzzy BSP (RL-HNFB) and 
Reinforcement Learning Hierarchical Neuro-Fuzzy 
Politree (RL-HNFP), descend from the original HNFB 
that uses Binary Space Partitioning (see Hierarchical 
Neuro-Fuzzy Systems Part I). By using hierarchical 
partitioning, together with the Reinforcement Learn- 
ing (RL) methodology, a new class of Neuro-Fuzzy 
Systems (SNF) was obtained, which executes, in 
addition to automatically learning its structure, the 
autonomous learning of the actions to be taken by an 
agent, dismissing a priori information (number of rules, 
fuzzy rules and sets) relative to the learning process. 
These characteristics represent an important differential 
when compared with existing intelligent agents learning 
systems, because in applications involving continuous 
environments and/or environments considered to be 
highly dimensional, the use of traditional Reinforce- 
ment Learning methods based on lookup tables (a 
table that stores value functions for a small or discrete 
state space) is no longer possible, since the state space 
becomes too large. 



This second part of hierarchical neuro-fuzzy 
systems focus on the use of reinforcement learning 

process. The first part presented HNFB models based 
on supervised learning methods. The RL-HNFB and 
RL-HNFP models were evaluated in a benchmark 
control application and a simulated Khepera robot 
environment with multiple obstacles. 



BACKGROUND 

The model described in this paper was developed based 
on an analysis of the limitations in existing models and 
of the desirable characteristics for RL-based learn- 
ing systems, particularly in applications involving 
continuous and/or high dimensional environments 
[ Jouffe, 1 998] [Sutton, 1 998] [Barto,2003] [Satoh,2006] . 
Thus, the Reinforcement Learning Hierarchical Neuro- 
Fuzzy Systems have been devised to overcome these 
basic limitations. Two different models of this class of 
neuro-fuzzy systems have been developed, based on 
reinforcement learning techniques. 



HIERARCHICAL NEURO-FUZZY 
SYSTEMS 

This section presents the new class of neuro-fuzzy 
systems that are based on hierarchical partitioning. As 
mentioned in the first part, two sub-sets of hierarchical 
neuro-fuzzy systems have been developed, accord- 
ing to the learning process used: supervised learning 
models (HNFB [Souza,2002][Vellasco,2004], HNFB" 1 
[Gongalves,2006], HNFB-Mamdani [Bezerra,2005]); 
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and reinforcement learning models (RL-HNFB 
[Figueiredo,2005a], RL-HNFP [Figueiredo,2005b]). 
The focus of this article is on the second sub-set of 
models. These models are described in the following 
sections. 



REINFORCEMENT LEARNING 
HIERARCHICAL NEURO-FUZZY 
MODELS 

The RL-HNFB and RL-HNFP models are composed 
of one or various standard cells, called RL-neuro- 
fuzzy-BSP (RL-NFB) and RL-neuro-fuzzy-Politree 
(RLNFP), respectively. The following sub-sections 
describe the basic cells, the hierarchical structures and 
the learning algorithm. 

Reinforcement Learning Neuro-Fuzzy 
BSP and Politree Cells 

An RL-NFB cell is a mini-neuro-fuzzy system that 
performs binary partitioning of a given space in ac- 
cordance with p and \a membership functions. In the 
same way, an RL-NFP cell is a mini-neuro-fuzzy system 
that performs 2 n partitioning of a given input space, 
also using complementary membership functions in 
each input dimension. The RL-NFB and RL-NFP cells 
generate a precise (crisp) output after the defuzzification 
process [Figueiredo,2005a][Figueiredo,2005b]. 

The RL-NFB cell has only one input (x) associated 
with it. The RL-NFP cell receives all the inputs that 
are being considered in the problem. For illustration 
purpose, figure 1(a) depicts a cell with two inputs - x 1 
and x 2 - (Quadtree partitioning), providing a simpler 
representation than the n-dimensional form of Politree. 
In figure 1(a) each partitioning is generated by the 
combination of two membership functions - p (low) 
and \jl (high) of each input variable. 

The consequents of the cell's poli-partitions may 
be of the singleton type or the output of a stage of a 
previous level. Although the singleton consequent is 
simple, this consequent is not previously known because 
each singleton consequent is associated with an action 
that has not been defined a priori. Each poli-partition 
has a set of possible actions (a 1? a 2 , ... a n ), as shown 
in figure 1(a), and each action is associated with a 
Q-value function. The Q-value is defined as being the 
sum of the expected values of the rewards obtained 



by the execution of action a in state s, in accordance 
with a policy n. For further details about RL theory, 
see[Sutton,1998]. 

The linguistic interpretation of the mapping imple- 
mented by the RL-NFP cell depicted in Figure 1(a) is 
given by the following set of rules: 

rule^ If x x e p^ and x 2 e p 2 then y = a. 
rule 2 : If x 1 e p^ and x 2 e \x 2 then y = a. 
rule 3 : If x 1 e [^ and x 2 e p 2 then y = a 
rule 4 : If x 1 e M^ and x 2 e \x 2 then y = a 

where consequent a. corresponds to one of the two 
possible consequents below: 

a singleton (fuzzy singleton consequent, or zero-order 
Sugeno): the case where a .^constant; 

the output of a stage of a previous level: the case 
where a=y m , where y m represents the output of a 
generic cell 'm\ 

RL-HNFB and RL-HNFP Architectures 

RL-HNFB and RL-HNFP models can be created based 
on the interconnection of the basic cells. The cells form 
a hierarchical structure that results in the rules that 
compose the agent's reasoning. 

In the example of an architecture presented in fig- 
ure 1(b), the poli-partitions 1, 3, 4, ..., m-1 have not 
been subdivided, having as consequents of its rules 
the values a x , a 3 , a 4 , . . ., a ml , respectively. On the other 
hand, poli-partitions 2 and m have been subdivided; 
so the consequents of its rules are the outputs (y 2 and 
y m ) of subsystems 2 and m, respectively. On its turn, 
these subsystems have, as consequent, the values a 21 , 
a nn , ..., a , and a iy a . ..., a , respectively. Each 

22' ' 2m' ml' m2, ' mm' r J 

'a.' corresponds to a consequent of zero-order Sugeno 
(singleton), representing the action that will be identified 
(between the possible actions), through reinforcement 
learning, as being the most favorable for a certain state 
of the environment. It must be stressed that the definition 
of which partition must be subdivided or not is defined 
automatically by the learning algorithm. 

The output of the system depicted in figure l(b) 
(defuzzification) is given by equation (l). In these 
equations, a. corresponds to the firing level of partition 
z and a. is the singleton consequent of the rule associ- 
ated with partition z. 
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Figure 1. (a) RL-NHP cell; (b) RL-HNFP architecture 
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RL-HNFB and RL-HNFP Learning 
Algorithm 



algorithm [Sutton, 1998]. More details can be found 
B Mmi in [Figueiredo,2005b]. 
( 1 ) The RLHNFB and RLHNFP models have been 

evaluated in different control applications. Two of 

these control application are presented in the next 

section. 



The learning process starts with the definition of the 
relevant inputs for the system/environment where the 
agent is and the sets of actions it may use in order to 
achieve its objectives. The agent must run many cycles 
to ensure learning in the system/environment where it 
is. A cycle is defined as the number of steps the agent 
takes in the environment, which extends from the point 
he is initiated to the target point. 

The RL-HNFB and RL-HNFP models employ the 
same learning algorithm. Each partition chooses an 
action from its set of actions; the resultant action is 
calculated by the defuzzification process and repre- 
sents the action that will be executed by the agents' 
actuators. After the resultant action is carried out, the 
environment is read once again. This reading enables 
calculation of the environment reinforcement value that 
will be used to evaluate the action taken by the agent. 
The reinforcement is calculated for each partition of 
all active cells, by means of its participation in the 
resulting action. Thus, the environment reinforcement 
calculated by the evaluation function is backpropagated 
from the root-cell to the leaf-cells. Next, the Q-values 
associated to the actions that have contributed to the 
resulting action are updated, based on the SARSA 



CASE STUDIES 

Cart-Centering 

The cart-centering problem [Koza,1992] is generally 
used as a benchmark of the area of evolutionary pro- 
gramming, where the force that is applied to the car is 
of the "bang bang" type [Koza,1992]. This problem 
was used mainly for the purpose of evaluating how well 
the RL-HNFB and RL-HNFP models would adapt to 
changes in the input variable domain without having 
to undergo a new training phase. 

The problem consists of parking, in the centre of a 
one-dimensional environment, a car with mass m that 
moves along this environment due to an applied force 
F. The input variables are the position (x) of the car, 
and its velocity (v). The objective is to park the car 
in position x = with velocity v = 0. The equations 
of motion are (where the x parameter represents the 
time unit): 



= X,+T.V, 



v t+x =v t +T.F t /m 



(2) 
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The global reinforcement is calculated by equation 
(3) below: 



If (x>0 and v<0) or (x<0 and v>0) 



^global - K e 



Else R, h =0 

global 



(-|distanc e_objective|) 



+ k 2 e 



(\velocity\) 



(3) 



The evaluation function increases as the car gets 
closer to the centre of the environment with velocity 
zero. The Jq and k 2 coefficients are constants greater 
than 1 used for adapting the reinforcement values to 
the model's structure. The values used for time unit 
and mass were x=0.02 and m=2.0. 



The stopping criterion is achieved when the dif- 
ference between the velocity and the position value in 
relation to the objective (x=0 and v=0) is smaller than 
5% of the universe of discourse of the position and 
velocity inputs. 

Table 1 shows the average of the results obtained 
in 5 experiments for each configuration. The columns 
position and velocity limits refer to the limits imposed 
to the (position and velocity) state variables during 
learning and testing. The actions used in these experi- 
ments are: Fl={-1 50,-75,-50,-30,-20,-1 0,-5,0,5, 10,20, 
30,50,75,1 50} . The size of the structure column shows 
the average of the number of cells at the end of each 
experiment and the last column shows the average steps 
during the learning phase. The number of cycles was 



Table 1. Results of the RL-HNFB and RL-HNFP models applied to the cart-centering problem 



No. 


Position 
Limits 


Velocity 
Limits 


Size of the 
Structure 


Average 
steps 

learning 
phase 


RL- 
HNFB 1 


|10| 


|10| 


195 cells 


424 


RL- 
HNFB ? 


|3| 


|3| 


340 cells 


166 


RL- 
HNFP 1 


|10| 


|10| 


140 cells 


221 


RL- 
HNFP 7 


|3| 


|3| 


251 cells 


145 



Table 2. Testing results of the proposed models applied to the cart-centering problem 



Configuration 


Initial Position 


|3| 


PI 


111 


Average number of steps 


RL-HNFB 1 


387 


198 


141 


RL-HNFB 2 


122 


80 


110 


RL-NFHP 1 


190 


166 


99 


RL-NFHP 2 


96 


124 


68 
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fixed at 1000. At each cycle, the car's starting points 
werex=-3 orx=3. 

As can be observed from Table 1, the RL-HNFP 
structure is smaller because each cell receives both 
input variables, while in the case of the RL-HNFB 
model, a different input variable is applied at each 
level oftheBSP tree. 

Table 2 presents the results obtained for one of the 
5 experiments carried out at each configuration shown 
in Table 1 when the car starts out at points (-2,-1,1,2) 
which were not used in the learning phase. 

In the first configuration of each model, the results 
show that the broader the position and velocity limits 
are (in this case equal to 1 10|), the more difficult it is to 
learn. In these cases a small oscillation occurs around 
the central point. What actually happens is that the final 
velocity is very small but, after some time, it tends to 
move the car out of the convergence area, resulting in 
a peak of velocity in the opposite direction to correct 
the car's position. In the second configurations, fewer 
oscillations occur because the position and velocity 
limits were lowered to |3|. 

Khepera Robot 

The RL-HNFP model was also tested with a Khepera 
robot simulator [Figueiredo,2005b]. The model was 
tested in a squared environment where the agent moved 
from one of the corners to reach the diametric opposite 
corner. Nevertheless, he could not pass through the 
ambient center because of an obstacle. 

The Khepera robot acquires ambient signs using 8 
sensors grouped into 4: one ahead, one in each side and 



one behind. Its actions are executed via 2 independent 
motors with power varying between -20 and 20, one in 
the right side and the other in the left side. 

The RL-HNFP model was trained in one environ- 
ment with a big central square obstacle, called environ- 
ment I (Figure 2). It was tested in two environments: 
the same environment I (with a big central square 
obstacle) but with different initial positions; and 
another environment with four additional obstacles. 
These new multi-obstacle environment, (environment 
II), comprises: big central, small top-left, bottom-right 
and left obstacles (see Figure 3). 

In all experiments using both environments the 
robot's objective was to reach position (5, 5). In figures 
2 and 3 the white circles indicate the initial positions 
of the robot while the gray circles show their final 
positions. 




Figure 2. Tests in environments I 




Figure 3. Tests in environments II 
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Figure 2 shows the experiments on the environment 
I. The results refer to tests executed with a structure 
trained from a variety of positions different from the 
ones used on the tests, demonstrating that the acquired 
knowledge was generalized and that the obtained results 
conform to the expectative. 

Figures 3(a), 3(b) and 3(c) show experiments in 
environment II. The results described in these figures 
refer to tests executed in the environment II with the 
knowledge acquired in the environment I. Note that 
the three additional obstacles do not affect the results 
in these experiments. 

As demonstrated from these figures, the results in- 
dicate the good performance of the model. This is even 
more important when one considers that the number 
of learning cycles was very small (only 400 cycles). 
The result presented in figure 3(c) stands out this fact. 
In this case, the agent does not take the shortest path 
to the goal point. On the contrary, it turns round the 
environment in some points until take the way to the 
goal point. To improve the robot's performance, the 
model should be executed for at least 2000 cycles, as 
already demonstrated from previous cases using this 
application [Figueiredo,2005b]. 



FUTURE TRENDS 

Regarding the Reinforcement Learning Hierarchical 
neuro-fuzzy models, some improvements are also 
under development. To improve their performance it 
is intended to execute tests with Eligibility Traces [Sut- 
ton, 1998]. This is a method that does not update only 
the current state function value, but also the function 
value of previous states inside a predefined limit. 

Another proposal for improving model RL-NFHP 
is the use of the WoLF principle (Win or Learn Fast) 
[Bowling,2002] to modify the learning rate that adjusts 
the politics. The WoLF principle consists of learning 
quickly when it is losing and more slowly when it is 
winning. Also it is intended to evaluate these models 
using real robots. 

The RL-HNFP model is also being modified to be 
used in a cooperative multi-agents environment. In this 
environment the learning process is accomplished by 
sharing the acquired knowledge among the existent 
agents. 



CONCLUSION 

The objective of this paper was to introduce a new 
class of neuro-fuzzy models which aims to improve 
the weak points of conventional neuro-fuzzy systems. 
The models RL-NFHB and RL-HNFP belong to this 
new class of neuro-fuzzy systems called Hierarchical 
Neuro-Fuzzy System. 

The RL-NFHB and RL-HNFP models were able to 
create and expand the structure of rules without any 
prior knowledge (fuzzy rules or sets); extract knowl- 
edge from the agent's direct interaction with large 
and/or continuous environments (through reinforce- 
ment learning), in order to learn which actions are to 
be carried out; and produce interpretable fuzzy rules, 
which compose the agent's intelligence to achieve his 
goal(s). The agent was able to generalize its actions, 
showing adequate behaviour when the agent was in 
states whose actions had not been specifically learned. 
This capacity increases the agent's autonomy. 
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KEY TERMS 

Binary Space Partitioning: In this type of par- 
titioning, the space is successively divided in two 
regions, in a recursive way. This partitioning can be 
represented by a binary tree that illustrates the succes- 
sive n-dimensional space sub-divisions in two convex 
subspaces. The construction of this partitioning tree 
(BSP tree) is a process in which a subspace is divided 
by a hyper-plan parallel to the co-ordinates axes. This 
process results in two new subspaces that can be later 
partitioned by the same method. 

Fuzzy Inference Systems: Fuzzy inference is the 
process of mapping from a given input to an output using 
fuzzy logic. The mapping then provides a basis from 
which decisions can be made, or patterns discerned. 
Fuzzy inference systems have been successfully applied 
in fields such as automatic control, data classification, 
decision analysis. 

Machine Learning: Concerned with the design and 
development of algorithms and techniques that allow 
computers to "learn". The major focus of machine 
learning research is to automatically extract useful 
information from historical data, by computational 
and statistical methods. 

Politree Partitioning: The Politree partitioning 
was inspired by the quadtree structure, which has been 
widely used in the area of images manipulation and 
compression. In the politree partitioning the subdivision 
of the n-dimensional space is accomplished by m-2 n 
subdivision. The Politree partitioning can be represented 
by a tree structure where each node is subdivided in m 
leafs (Politree partitioning). 

Quadtree Partitioning: In this type of partitioning, 
the space is successively divided in four regions, in a 
recursive way. This partitioning can be represented 
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by a quaternary tree that illustrates the successive n- 
dimensional space sub-divisions in four convex sub- 
spaces. The construction of this partitioning tree (Quad 
tree) is a process in which a subspace is divided by a 
two hyper-plan parallel to the co-ordinates axes. This 
process results in four new subspaces that can be later 
partitioned by the same method. The limitation of the 
Quadtree partitioning (fixed or adaptive) is in the fact 
that it works only in two-dimensional spaces. 

Reinforcement Learning: A sub-area of machine 
learning concerned with how an agent ought to take 
actions in an environment so as to maximize some 
notion of long-term reward. Reinforcement learning 
algorithms attempt to find a policy that maps states of 



the world to the actions the agent ought to take in those 
states. Differently from supervised learning, in this case 
there is no target value for each input pattern, only a 
reward based of how good or bad was the action taken 
by the agent in the existant environment. 

Sarsa: It is a variation of the Q-learning (Reinforce- 
ment Learning) algorithm based on model-free action 
policy estimation. SARSA admits that the actions are 
chosen randomly with a predefined probability. 

WoLF: ("Win or Learn Fast") is a method by [Bowl- 
ing,2002] for changing the learning rate to encourage 
convergence in a multi-agents reinforcement learning 
scenario. 
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INTRODUCTION 



BACKGROUND 



Reinforcement learning (RL) deals with the problem of 
an agent that has to learn how to behave to maximize its 
utility by its interactions with an environment (Sutton 
& Barto, 1998; Kaelbling, Littman & Moore, 1996). 
Reinforcement learning problems are usually formal- 
ized as Markov Decision Processes (MDP), which 
consist of a finite set of states and a finite number of 
possible actions that the agent can perform. At any given 
point in time, the agent is in a certain state and picks 
an action. It can then observe the new state this action 
leads to, and receives a reward signal. The goal of the 
agent is to maximize its long-term reward. 

In this standard formalization, no particular structure 
or relationship between states is assumed. However, 
learning in environments with extremely large state 
spaces is infeasible without some form of generaliza- 
tion. Exploiting the underlying structure of a problem 
can effect generalization and has long been recognized 
as an important aspect in representing sequential deci- 
sion tasks (Boutilier et al., 1999). 

Hierarchical Reinforcement Learning is the subfield 
of RL that deals with the discovery and/or exploitation 
of this underlying structure. Two main ideas come into 
play in hierarchical RL. The first one is to break a task 
into a hierarchy of smaller subtasks, each of which can 
be learned faster and easier than the whole problem. 
Subtasks can also be performed multiple times in the 
course of achieving the larger task, reusing accumulated 
knowledge and skills. The second idea is to use state 
abstraction within subtasks: not every task needs to be 
concerned with every aspect of the state space, so some 
states can actually be abstracted away and treated as 
the same for the purpose of the given subtask. 



In this section, we will introduce the MDP formalism, 
where most of the research in standard RL has been 
done. We will then mention the two main approaches 
used for learning MDPs: model-based and model-free 
RL. Finally, we will introduce two formalisms that 
extend MDPs and are widely used in the Hierarchical 
RL field: semi-Markov Decision Processes (SMDPs) 
and Factored MDPs. 

Markov Decision Processes (MDPs) 

A Markov Decision Process consists of: 

a set of states S 

a set of actions A 

a transition probability function: Pr(s' \ s, a), 

representing the probability of the environment 

transitioning to state s' when the agent performs 

action a from state s. It is sometimes notated T(s, 

a, s '). 

a reward function: E[r | s, a], representing the 

expected immediate reward obtained by taking 

action a from state s. 

a discount factor y e (0, 1], that downweights 

future rewards and whose precise role will be 

clearer in the following equations. 

A deterministic policy tt:S -> A is a function that 
determines, for each state, what action to take. For 
any given policy tt, we can define a value function V n , 
representing the expected infinite-horizon discounted 
return to be obtained from following such a policy 
starting at state s: 

V n (s) = E[r + y r+ y 2 r 2 + y 3 r 3 + . . J. 
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Bellman (1957) provides a recursive way of de- 
termining the value function when the reward and 
transition probabilities of an MDP are known, called 
the Bellman equation: 

V(s) = R(s, n(s)) + Y I « T(s, n(s), s >) Vn(s % 

commonly rewritten as an action-value function or 
Q- function: 

Qn(s,a) = R(s, a)+y I m T(s, a, s ') Vn(s '). 

An optimal policy t\*(s) is a policy that returns the 
action a that maximizes the value function: 

n*(s) = argmax a Q*(s,a) 

States can be represented as a set of state variables 
or factors, representing different features of the envi- 
ronment: s = <f 1 ,f 2 ,f 3 ,...,f>. 

Learning in Markov Decision Processes 
(MDPs) 

The reinforcement-learning problem consists of de- 
termining or approximating an optimal policy through 
repeated interactions with the environment (i.e., based 
on a sample of experiences of the form <state - action 
- next state - reward>). 

There are three main approaches to learning such 
an optimal or near-optimal policy: 

• Policy-search methods: learn a policy directly 
via evaluation in the environment. 

• Model-free (or direct) methods: learn the policy 
by directly approximating the Q function with 
updates from direct experience. 

• Model-based (or indirect) methods: first learn 
the transition probability and reward functions, 
and use those to compute the Q function by means 
of , for example, the Bellman equations. 

Model-free algorithms are sometimes referred to as 
the Q-learning family of algorithms. See Sutton (1988) 
or Watkins (1989) for the first best-known examples. 
It is known that model-free methods make inefficient 
use of experience, but they do not require expensive 



computation to obtain the Q function and the corre- 
sponding optimal policy. 

Model-based methods make more efficient use of 
experience, and thus require less data, but they involve 
an extra planning step to compute the value function, 
which can be computationally expensive. Some well- 
known algorithms can be found in the literature (Sutton, 
1990; Moore &Atkeson, 1993; Kearns& Singh, 1998; 
and Brafman & Tennenholtz, 2002). 

Algorithms for reinforcement learning in MDP 
environments suffers from what is known as the curse 
of dimensionality: an exponential explosion in the total 
number of states as a function of the number of state 
variables. To cope with this problem, hierarchical 
methods try to break down the intractable state space 
into smaller pieces, which can be learned independently 
and reused as needed. To achieve this goal, changes 
need to be introduced to the standard MDP formal- 
ism. In the introduction we mentioned the two main 
ideas behind hierarchical RL: task decomposition and 
state abstraction. Task decomposition implies that the 
agent will not only be performing single-step actions, 
but also full subtasks which can be extended in time. 
Semi-Markov Decision Processes (SMDPs) will let 
us represent these extended actions. State abstraction 
means that, in certain contexts, certain aspects of the 
state space will be ignored, and states will be grouped 
together. Factored-state representations is one way of 
dealing with this. The following section introduces these 
two common formalisms used in the HRL literature. 

Beyond MDPs: SMDPs and 
Factored -State Representations 

We'll consider the limitations of the standard MDP 
formalism by means of an illustrating example. Imagine 
an agent whose task is to exit a multi-storyed office 
building. The starting position of the agent is a certain 
office in a certain floor, and the goal is to reach the 
front door at ground level. To complete the task, the 
agent has to first exit the room, find its way through 
the hallways to the elevator, take the elevator to the 
ground floor, and finally find its way from the elevator 
to the exit. We would like to be able to reason in terms 
of subtasks (e.g., "exit room", "go to elevator", "go 
to floor X'\ etc.), each of them of different durations 
and levels of abstraction, each encompassing a series of 
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lower-level or primitive actions. Each of these subtasks 
is also concerned with only certain aspects of the full 
state space: while the agent is inside the room, and the 
current task is to exit it, the floor the elevator is on, or 
whether the front door of the building is open or closed, 
is irrelevant. However, these features will become 
crucial later as the agent's subtask changes. 

Under the MDP formalization, time is represented 
as a discrete step of unitary and constant duration. This 
formulation does not allow the representation of tempo- 
rally extended actions of varying durations, amenable 
to represent the kind of higher-level actions identified 
in the example. The formalism of semi-Markov Deci- 
sion Processes (SMDPs) enables this representation 
(Puterman, 1994). In SMDPs, the transition function 
is altered to represent the probability that action a from 
state s will lead to next state s' after t timesteps: 

Pr(s ', t\ s, a) 

The corresponding value function is now: 

V(s) = R(s, n(s)) + Z, eS f Pr(s \t\s, a) Vn(s ') 

SMDPs also enable the representation of continu- 
ous time. For dynamic programming algorithms for 
solving SMDPs, see Puterman (1994) and Mahadevan 
etal., (1997). 

Factored-state MDPs deal with the fact that certain 
aspects of the state space are irrelevant for certain 
actions. In factored-state MDPs, state variables are 
decomposed into independently specified components, 
and transition probabilities are defined as a product 
of factor probabilities. A common way of represent- 
ing independence relations between state variables 
is through Dynamic Bayes Networks (DBNs). As an 
example, imagine that the state is represented by four 
state variables: s = <f p f 2 , f 3 , f>, and we know that for 
action a the value of variable f t in the next state only 
depends on the prior values of f ± and f 4 , f 2 depends on 
f 2 and f 3 , and the others only depend on their own prior 
value. This transition probability in a Factored MDP 
would be represented as: 

Pr(s'\ s, a) = Pr(f ± > \ fj 4 , a) Pr(f 2 > \ f 2 f 3 , a) Pr(f 3 > \ f 3 , 
a)Pr(f>\f,a) 



For learning algorithms in factored-state MDPs, see 
Kearns & Koller (1999) and Guestrin et al. (2002). 



HIERARCHICAL 

REINFORCEMENT-LEARNING 

METHODS 

Different approaches and goals can be identified within 
the hierarchical reinforcement-learning subfield. Some 
algorithms are concerned with learning a hierarchical 
view of either the environment or the task at hand, 
while others are just concerned with exploiting this 
knowledge when provided as input. Some techniques 
try to learn or exploit temporally extended actions, 
abstracting together a set of actions that lead to the 
completion of a subtask or subgoal. Other methods 
try to abstract together different states, treating them 
as if they were equal from the point of view of the 
learning problem. 

We will briefly review a set of algorithms that use 
some combination of these approaches. We will also 
identify which of these methods are based on the model- 
free learning paradigm as opposed to those that try to 
construct a model of the environment. 

Options: Learning Temporally Extended 
Actions in the SMDP Framework 

Options make use of the SMDP framework to allow the 
agent to group together a series of actions (an option's 
policy) that lead to a certain state or set of states identified 
as subgoals. For each option, a set of valid start states 
is also identified, where the agent can decide whether 
to perform a single-step primitive action, or to make 
use of the option. We can think of options as pre-stored 
policies for performing abstract subtasks. 

A learning algorithm for options is described by 
Sutton, Precup & Singh (1999) and belongs to the 
model-free Q-learning family. In its current formulation, 
the options framework allows for two-level hierarchies 
of tasks, although they could potentially be general- 
ized to multiple levels. End states (i.e., subgoals) are 
given as input to the algorithm. There is work devoted 
to discovering these subgoals and constructing useful 
options from them (§im§ek et al., 2005; and Jong & 
Stone, 2005). 
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While options have been shown to improve the 
learning time of model-free algorithms, it is not clear 
that there is an advantage in terms of learning time 
over model-based methods. As any model- free method, 
though, they do not suffer from the computational cost 
involved in the planning step. It is still an open question 
whether options can be generalized to multiple-level 
hierarchies, and most of the work is empirical, with no 
theoretical bounds. 

MaxQ: Combining a Hierarchical Task 
Decomposition with State Abstraction 

MaxQ is also a model-free algorithm in the Q-learning 
family. It receives as input a multi-level hierarchical task 
decomposition, which decomposes the full underlying 
MDP into an additive combination of smaller MDPs. 
Within each task, abstraction is used so that state 
variables that are irrelevant for the task are ignored 
(Dietterich, 2000). 

The main drawback of MaxQ is that the hierarchy 
and abstraction have to be provided as input, and in 
it's model-free form it misses opportunities for faster 
learning. 

DSHP: Model-Based Hierarchical 
Decomposition for Efficient Learning 
and Planning 

Deterministic Sample-Based Hierarchical Planning 
(DSHP) combines factored-state MDP representations, 
a MaxQ hierarchical task decomposition, and model- 
based learning to achieve provably efficient learning 
and planning in deterministic domains (Diuk, Strehl 
& Littman, 2006). 

While, as a model-based algorithm, DSHP allows 
for faster learning and planning, it still suffers from 
the problem that the hierarchy and abstraction have to 
be provided as input. 

HEXQ: Discovering Hierarchy 

As opposed to MaxQ, DSHP, or other methods that 
receive the hierarchical task decomposition as input, 
HEXQ tries to automatically discover it. HEXQ analyses 
traces of experience and identifies regions of the MDP 
with repeated characteristics. It uses this experience to 
build temporal and state abstractions, constructing a 



hierarchy of smaller interlinked MDPs. HEXQ is model- 
free and based on Q-learning (Hengst, 2002). 

HEXQ shows a promising method for discovering 
abstractions and hierarchies, but still suffers from a 
lack of any theoretical bounds or proofs. All the work 
using HEXQ has been empirical, and it's general power 
still remains an open question. 

HAM-PHAM: Restricting the Class of 
Possible Policies 

Hierarchies of Abstract Machines (HAMs) also make 
use of the SMDP formalism. The main idea is to re- 
strict the class of possible policies by means of small 
nondeterministic finite-state machines, which constrain 
the sequences of actions that are allowed. Elements in 
HAMs can be thought of as small programs, which at 
certain points can decide to make calls to other lower- 
level programs (Parr & Russell, 1997; andParr, 1998). 
See also Programmable HAMs (PHAMs), an extension 
by Andre & Russell (2000). 

HAM provides an interesting approach to make 
learning and planning easier, but has also only been 
shown to work better in certain empirical examples. 



FUTURE TRENDS 

We expect to see most of the new work in the field 
of Hierarchical Reinforcement Learning tackling two 
areas: hierarchy and abstraction discovery, and transfer 
learning. We believe the main open question is how 
structure can be learned from experience, and once 
learned be applied to tasks and problems different from 
the original one. 

There is also promising but still little theoretical 
work currently being produced in the area, work that 
could prove the general power of different methods. 
Most of the work is empirical and only shown to work 
through experiments in small domains. 



CONCLUSION 

The goal of hierarchical reinforcement learning is to 
combat the "curse of dimensionality", the main obstacle 
in achieving scalable RL that can be applied to real-life 
problems, by means of hierarchical task decompositions 
and state abstraction. This active area of research has 
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achieved mixed results, with algorithms and frameworks 
focusing on just one or two combinations of the dif- 
ferent aspects of the problem. A single approach that 
can deal with structure discovery and its use, with both 
temporal and state abstraction, and that can provably 
learn and plan in polynomial time is still the main item 
in the research agenda of the field. 
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KEY TERMS 

Factored-State Markov Decision Process: An 

extension to the MDP formalism used in Hierarchical 
RL where the transition probability is defined in terms 
of factors, allowing the representation to ignore certain 
state variables under certain contexts. 

Hierarchical Reinforcement Learning: Asubfield 

of reinforcement learning concerned with the discovery 
and use of task decomposition, hierarchical control, 
temporal and state abstraction (Barto & Mahadevan, 
2003). 



Hierarchical TaskDecomposition: Adecomposi- 

tion of a task into a hierarchy of smaller subtasks. 

Markov Decision Process: The most common 
formalism for environments used in reinforcement 
learning, where the problem is described in terms of 
a finite set of states, a finite set of actions, transition 
probabilities between states, a reward signal and a 
discount factor. 

Reinforcement Learning: The problem faced by 
an agent that learns to a utility measure behavior from 
its interaction with the environment. 

Semi-Markov Decision Process: An extension to 
the MDP formalism that deals with temporally extended 
actions and/or continuous time. 

State-Space Generalization: The technique of 
grouping together states in the underlying MDP and 
treating them as equivalent for certain purposes. 
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INTRODUCTION 

Artificial neural networks (ANNs) are systems which 
are derived from the field of neuroscience and are char- 
acterized by intensive arithmetic operations. These net- 
works display interesting features such as parallelism, 
classification, optimization, adaptation, generalization 
and associative memories. Since the McCulloch and 
Pitts pioneering work (McCulloch, W.S., & Pitts, W. 
(1943), there has been much discussion on the topic 
of ANNs implementation, and a huge diversity of 
ANNs has been designed (C. Lindsey & T. Lindblad, 
1994). The benefits of using such implementations is 
well discussed in a paper by R. Lippmann (Richard P. 
Lipmann, 1 984): "The great interest of building neural 
networks remains in the high speed processing that can 
be achieved through massively parallel implementa- 
tion". In another paper Clark S. Lindsey (C.S Lindsey, 
Th. Lindbald, 1 995) posed a real dilemma of hardware 
implementation: "Built a general, but probably ex- 
pensive system that can be reprogrammed for several 
kinds of tasks like CNAPS for example? Or build a 
specialized chip to do one thing but very quickly, like 
the IBM ZISC Processor". To overcome this dilemma, 
most researchers agree that an ideal solution should 
relay the performances obtained using specific hardware 
implementation and the flexibility allowed by software 
tools and general purpose chips. 

Since their commercial introduction in the mid- 
1980 's, and due to the advances in the development 



of both of the microelectronic technology and the 
specific CAD tools, FPGAs devices have progressed in 
an evolutionary and revolutionary way. The evolution 
process has allowed faster and bigger FPGAs, better 
CAD tools and better technical support. The revolution 
process concerns the introduction of high performances 
multipliers, Microprocessors and DSP functions. This 
has a direct incidence to FPGAimplementation of ANNs 
and a lot of research has been carried to investigate 
the use of FPGAs in ANNs implementation (Amos R. 
Omandi & Jagath C. rajapakse, 2006). 

Another attractive key feature of FPGAs is their 
flexibility, which can be obtained at different levels: 
exploitation of the programmability of FPGA, dynamic 
reconfiguration or run time reconfiguration (RTR), 
(Xilinx XAPP290, 2004) and the application of the 
design for reuse concept (Keating, Michael; Bricaud, 
Pierre, 2002). 

However, a big disadvantage of FPGAs is the low 
level hardware oriented programming model needed to 
fully exploit the FPGA's potential performances. 

High level based VHDL synthesis tools have been 
proposed to bridge the gap between the high level 
application requirements and the low level FPGA 
hardware but these tools are not algorithmic or ap- 
plication specific. Thus, special concepts need to be 
developed for automatic ANN implementation before 
using synthesis tools. 

In this paper, we present a high level design method- 
ology for ANN implementation that attempts to build a 
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bridge between the synthesis tool and the ANN design 
requirements. This method offers a high flexibility in 
the design while achieving speed/area performances 
constraints. The three implementation figures of the 
ANN based back propagation algorithm are considered. 
These are the off-type implementation, the on-chip 
global implementation and the dynamic reconfiguration 
choices of the ANN. 

To achieve our goal, a design for reuse strategy 
has been applied. To validate our approach, three case 
studies are considered using the Virtex-II and Virtex-4 
FPGA devices. A comparative study is done and new 
conclusions are given. 



BACKGROUND 

In this section, theoretical presentation of the multilayer 
perceptron (MLP) based back propagation algorithm 
is given. Then, discussion of the most related works to 
the topics of high level design methodology and ANNs 
frameworks are given. 

Theoretical Background of the Back 
Propagation Algorithm 

The back propagation is one of the well known algo- 
rithms that are used to train the MLP ANN network 
in a supervised mode. The MLP is executed in three 
phases: the feed forward phase, the error calculation 
phase and the synaptic weight updating phase (Free- 
man, J. A. and Skapura, D. M, 1991). 

In the feed forward phase, a pattern x. is applied 
to the input layer and the resulting signal is forward 
propagated through the network until the final outputs 
have been calculated; for each z (index of neuron) and 
j (index of layer) 



St = r(MiM-yi) 



(3) 



^ = Xwp Xi 



o) = f{xj)- 



l+exp(-(jj) 



(i) 

(2) 



where, \i) is the weighted sum of the synaptic weights 
and o) is the output of the sigmoid activation func- 
tion. 

The error calculation step, computes the local error, 
5 for each layer starting from output back to input: 



Sj—f'^! 4 )!^/ i*i*tfi > imr< (4) 



where, d. is the desired output f the derivative func- 
tion of f 

The Weight update step computes the weights up- 
dates according to: 



wJ(t + l) = wJ(t) + AwJ(t) 



Awl^^bjyy 1 



(5) 



(6) 



where, r| is the learning factor, Aw the variation of 
weights and /, the indices of the layers. 

Background on ANN Frameworks 

The most related works to ANNs frameworks are pre- 
sented by (F. Schurmann & all, 2002), (M. Diepenhorst 
& all, 1999), and (J. Zhu & all, 1999). 

In the other hand, and with the increasing complexity 
of FPGAs circuits, Core -based synthesis methodol- 
ogy is proposed as a new trend for efficient hardware 
implementation of FPGAs. In these tools a library 
of pre-designed IPs "Intellectual Property" cores are 
proposed. An example can be found in (Xilinx Core 
Generator reference) and (Opencores reference). 

In the core based design methodology, efficient 
reuse is derived from the parameterized design with 
VHDL and its many flexible constructs and charac- 
teristics (i.e. abstraction, encapsulation, inheritance 
and reuse through attributes, package, procedures and 
functions). Beside this, the reuse concept is well suited 
for high regular and repetitive structures such as neural 
networks. However although all these advantages, 
seldom attention has been done to apply design for 
reuse for ANNs. 

In this context our paper presents a new high level 
design methodology based upon the use of the design 
for reuse concept for ANNs. 

In order to achieve this goal, the design must fulfill 
the following requirements (Keating, Michael; Bricaud, 
Pierre, 2002): 

The design must be block-based 
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The design must be reconfigurable to meet the 
requirement of many different applications. 
The design must use standard interfaces. 
The code must be synthesizable at the RTL 
level. 

The design must be verified. 
The design must have robust scripts and must be 
well documented. 



PRESENTATION OF THE PROPOSED 
DESIGN APPROACH 

The proposed design approach is shown in Fig. 1 as a 
process of flow. In this figure, the methodology used 
is based on a top down design approach in which 
the designer/user is guided step by step in the design 
process of the ANN. 

First, the user is asked to select the dimension 
of the network. The next step involves selection of 
ANN implementation choices; these are the off chip 



implementation, the global on chip implementation 
and implementation using run time reconfiguration 
(RTR). Thus a Core is generated for each type of 
implementation. 

At this level, the user/designer can fix the parameters 
of the network, i.e. the number of neurons in each layer, 
synaptic values, multiplier type, data representation and 
precision. At the low level all the IP Cores that construct 
the neuron are generated automatically from the library 
of the synthesis tool which is in our case MENTOR 
GRAPHICS (Mentor Graphics user guide reference), 
and which also integrates the Xilinx IP Core Generator. 
In addition, for each IP Core, a graphical interface is 
generated to fix its parameters. Thus, the user/designer 
can change the network performances architecture by 
changing the IP cores that are stoked in the library. 
Then a VHDL code at the register transfer level (RTL) 
is generated for synthesis. Before, functional simulation 
is required. The result is a file netlist ready for place 
and rout followed by final FPGA prototyping on a 
board. Documentation is available at each level of the 




Figure 1. The proposed design methodology 




FPGA 
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design process and the code is well commented. Thus, 
the design for reuse requirements is applied through 
the design process. In what follow, presentation of each 
implementation type is given. 

The Feed Forward Off-Chip 
Implementation 

Fig. 2 shows a top view of the feed forward core which 
is composed of a data path module and a control mod- 
ule. At the top level these two modules are represented 
by black boxes and only the neural network inputs and 
outputs signals are shown to the user/designer. 

By clicking inside the boxes, we can get access to 
the network architecture which is composed of three 
layers represented by black boxes as shown in Fig. 3 
(left side). By clicking inside each box, we can get 
access to the layer architecture which is composed of 
black boxes representing the neurons as shown in Fig. 
3 (right side); and by clicking inside each neuron's box 
we can get access to the neuron hardware architecture 
as shown in Fig 4. 

Each neuron implements the accumulated weight 
sum of equation (1) and the activation function of 



equation (2). As shown in Fig. 4, the hardware model 
of the neuron is mainly based on a: 

Memory circuit where the final values of the 

synaptic weights are stocked, 

A multiply circuit (MULT) which computes the 

product of the stored synaptic weights with inputs 

data 

An accumulator circuit (ACUM) which computes 

the sum of the above products 

Acircuit that approximates the activation function 

(example linear function or sigmoid function) 

A multiplexer circuit (MUX) in the case of serial 

transfer between inputs in the same neuron 

The neural network architecture has the following 
properties: 

Computation between layers is done serially 
For the same layer, neurons are computed in 
parallel 
• For the same neuron, only one multiplier and one 
accumulator (MULT +ACUM=MAC) are used 
to compute the product sum. 



Figure 2.The feed forward core module using the mentor graphics design tool 
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Figure 3. The ANN architecture 
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Figure 4. Equivalent hardware architecture of the neuron 
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Each multiplier is connected to a memory. The Each circuit that constructs the neuron is an IP core 

depth of each memory is equal to the number of "Intellectual Property" that can be generated from the 

neurons constituting the layer Xilinx Core Generator. 

The whole network is controlled by a control unit The feed forward control module is composed 

module. of three phases: control of the neuron, control of the 
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layer and control of the network. Considering the 
fact that neurons work in parallel, so control of the 
layer is similar to the control of the neuron plus the 
multiplexer's control. Control of the neuron is divided 
into four phases: start, initialization, synaptic multipli- 
cation/accumulation and storage of the weighted sum. 
The first state diagram of the feed forward control 
module which was designed, was based on the Moore 
machine in which the system vary only when its state 
change. The drawback of this machine is that it is not 
generic. For example, (load=0, reset=0) allows the ac- 
cumulator to add a value present at the input register. 
This accumulated value must be done as many times 
as the number of neurons in the previous layer. Thus, 
if we change the number of neurons from one layer to 
another one, we have to change all the flow state of the 
control module. To overcome this problem, the Moor 
machine is replaced by the Mealy machine in which 
we add a counter program with a generic value M and 
a transition variable Max such that: 



ifoutput 
else 



-M 



> Max 
■ Max 



where the value of M is done equal to the number of 
neuron. 

By using this strategy, we obtain an architecture 
that has two important key features: generecity and 
flexibility. Generecity is related to the data word size, 
precision, and memory depth which are kept as generic 
parameters at the top level of the VHDL description. 
The flexibility feature is related to the size of the 
network (the number of neurons in each layer), thus 
it is possible to add neurons by simple copy/past of 
the neurons boxes or cores and it is also possible to 
remove them by simple cut operation of the boxes. It 
is also possible to use other IP cores from the library 
(example replace parallel MULT with pipeline MULT) 
to change the performances of the network without 
changing the VHDL code. Thus, the design for reuse 
concept is applied. 

The Direct On-Chip Implementation 
Strategy 

In this section, we propose the equivalent architecture 
for implementation of the three successive phases of 
the back propagation algorithm. Fig. 5 depicts the pro- 



posed architecture which is composed of a feed forward 
module, an Error-calculation module and an Update 
module. The set of the three modules is controlled 
by a global control unit. The feed forward module 
computes equations (1) and (2). The Error module 
computes equations (3) and (4) and the Update module 
computes equations (5) and (6). Each module exhibits 
a high degree of regularity of the structure, modularity 
and repeatability which make the whole ANN a good 
candidate for the application of the design for reuse 
concept. As in the off-chip implementation case, first 
the unit control unit has been done using a Moore 
machine that integrates control of the three modules: 
feed forward, error and update modules. In order to 
achieve reuse, we have replaced the Moore machine 
by a Mealy machine. Thus, the size of the network can 
be modified by simple copy/past or remove operations 
of the boxes. 

The Run Time Reconfiguration Strategy 

Our strategy for run time reconfiguration follows the 
following steps: first the feed forward and the global 
control modules are configured. The results are stored 
in the Bus macro module of the Virtex FPGA device. 
In the next step, the feed forward module is reset 
from the FPGA and the Update and Error modules are 
configured. The generated results are stored in the Bus 
macro modules and the same procedure is applied to 
the next training example of the ANN. A more detailed 
description is given in (N. Izeboudjen and all, 2007). 

Performance Evaluation 

In this section, we discuss the performance of the three 
implementation figures of the back propagation algo- 
rithm. The parameters to be considered are the number 
of configurable logic blocs (CLB), the time response 
(TR) and the number of Million connexions per second 
(MCPS). A comparison of these parameters is done 
between the Virtex-II and Virtex-4 families. Functional 
simulation is achieved using ModelSim simulator 
(ModelSim user guide reference). The RTL synthesis 
is achieved using the Mentor graphics synthesis tool 
(Mentor Graphics synthesis tool user guide reference) 
and for final implementation, the ISE foundation place 
and rout (8.2) tool is used (ISE foundation user guide 
reference). 
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Figure 5. Architecture of the BP algorithm 
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Our first application is an ANN classifier that is 
used to classify heart coronary diseases. The network 
has been trained off chip using the MATLAB 6.5 tool. 
After training the dimension of the network as well 
as the synaptic weight were fixed. The network has a 
dimension of (1, 8, 1) and the synaptic weights have a 
data width of 24 bits. For this application we selected 
the circuits XC2V1000 and XC4VLX15 devices, of 
Virtex-II and Virtex-4 respectively. Synthesis results 
show that the XC2 V 1 000 circuit consume 99% in terms 
of (CLB), the time response TR = 44.46 (ns) while the 
MCPS=360. Concerning the XC4VLX1 5, it consumes 
82% in term of CLB, TR= 26.76 (ns) and MCPS= 597. 
Thus, the XC4VLX1 5 achieves better performances in 
term of area (gain 19% of CLB in term of area), the 
speed rate is 1.6 and MCPS rate is 1.6. 

Our second application is the classical (2, 2, 1) 
"XOR" ANN which is used as a benchmark for non-lin- 
early separable problems. The general on chip learning 
implementation has been applied to the network. It is to 
be mentioned that area constraints could not be met for 



the first family XC2V1000 as well as the XC4V1X15, 
and we have tried several families until we fixed the 
XC4V1X80 for Virtex-4 and the XC2V8000 for Virtex- 
II. Synthesis results show that the XC2V8000 circuit 
consume 22% in terms of (CLB), the time response 
TR= 59.5 (ns) while the MCPS=202. Concerning the 
XC4VLX80, it consumes 30% in term of (CLB), TR = 
47.93 (ns) and MCPS= 250. From these results we can 
conclude that with the Virtex-II family we can gain 8% 
of (CLB) in term of area ; this is due to the fact that the 
Virtex-II integrates more multipliers than the Virtex-4 
and in which the MAC component is integrated into the 
DSP48 (XC4V1X80 has 80 MAC DSP and XC2V8000 
has 168 bloc multipliers). But the Virtex-4 circuit is 
faster than the Virtex-II and can achieve more MCPS 
(rate of -1 .24). The on chip implementation requires a 
lot of multipliers and this is why, we recommend using 
it if the timing constraints are not critical. 

In the third application, three arbitrary networks are 
used to show the performance of the (RTR) over the 
global implementation. These are a (3,3,3) network, 
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a (16,16,16) network and a (16,64,8) network. The 
results show that when the size of the network is big 
it is difficult to implement the whole RPG into one 
FPGA. With the RTR we can achieve more than 30% 
reduction in the area and more than 40% increase in 
speed and MCPS. 



FUTURE TRENDS 

The proposed ANN environment is still under construc- 
tion. The design approach is based on the use of pre- 
designed IP cores which are generated from the Xilinx 
Core generator tool. Our next objective is to enrich 
and enhance the library of the IP cores, especially in 
the case of implementation of the activation function 
(sigmoid, linear transfer circuits), and to evaluate and 
compare the performances of the ANN regarding others 
pre-designed IP cores. 

Also, we plan to extend the reuse concept of the 
ANN to other ANNs algorithms (Kohonen, Hopfield 
networks) 

Concerning the run-time reconfiguration (RTR), 
the next step is to integrate the RTR design approach 
with the planeAhead design tool (PlanAhead user 
guide reference). 

As future work, we plan to evaluate and analysis the 
cost of the design for reuse concept applied to ANNs 



CONCLUSION 

Through this paper, we have presented a successful 
design approach for FPGA implementation ofANNs. 
We have applied a design for reuse strategy and 
parametric design to achieve our goal. The proposed 
methodology offers high flexibility because the size of 
the ANN can be changed by simple copy/remove of the 
neurons cores. In addition the format, data widths and 
precision are considered as generic parameters. Thus, 
different applications can be targeted in a reduced design 
time. As for the three applications, the first conclusion 
is that the new Virtex-4 FPGA devices achieve faster 
networks comparing to Virtex-II; but regarding to the 
area; i.e. number of CLBs, the Virtex-II is better. Thus 
in our opinion, the Virtex-II is well suited as a platform 
to experiment ANN implementations. This can help to 
give new directions for future work. 
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KEY TERMS 

ASIC: Acronym Application Specific Integrated 
Circuits 

CLB: Acronym for Configurable Logic Blocs 

FPGA: Field Programmable Gate Arrays 

High Level Synthesis : Atop down design methodol- 
ogy that transform an abstract level such as the VHDL 
language into a physical implementation level 



On-Chip Training: Aterm that design implementa- 
tion the three phases of the back propagation algorithm 
into one or several chips 

Off-Chip Training: Training of the network is done 
using software tools like MATLAB and only the feed 
forward phase is considered generalisation. 

RTL: Acronym of Register Transfer Level 

Run Time Reconfiguration: A solution that permits 
to use the smallest FPGA and to reconfigure it several 
times during the processing. Run time reconfiguration 
can be partial or global. 

VHDL: Acronym for Very high speed integrated 
circuits Hardware Description Language) 
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INTRODUCTION 

Biological vision processes are usually characterized 
by the following different phases: 

Awareness : natural or artificial agents operating in 
dynamic environments can benefit from a, possi- 
bly rough, global description of the surroundings. 
In human this is referred to as peripheral vision, 
since it derives from stimuli coming from the 
edge of the retina. 

Attention: once an interesting object/event has 
been detected, higher resolution is required to 
set focus on it and plan an appropriate reaction. 
In human this corresponds to the so-called foveal 
vision, since it originates from the center of the 
retina (fovea). 

Analysis: extraction of detailed information 
about objects of interest, their three-dimensional 
structure and their spatial relationships completes 
the vision process. Achievement of these goals 
requires at least two views of the surrounding scene 
with known geometrical relations. In humans, 
this function is performed exploiting binocular 
(stereo) vision. 

Computer Vision has often tried to emulate natural 
systems or, at least, to take inspiration from them. In fact, 
different levels of resolution are useful also in machine 
vision. In the last decade a number of studies dealing 
with multiple cameras at different resolutions have 
appeared in literature. Furthermore, the ever-growing 
computer performances and the ever-decreasing cost of 
video equipment make it possible to develop systems 



which rely mostly, or even exclusively, on vision for 
navigating and reacting to environmental changes in 
real time. Moreover, using vision as the unique sen- 
sory input makes artificial perception closer to human 
perception, unlike systems relying on other kinds of 
sensors and allows for the development of more direct 
biologically-inspired approaches to interaction with the 
external environment (Trullier 1997). 

This article presents HOPS (Hybrid Omnidirectional 
Pin-hole Sensor), a class of dual camera vision sensors 
that try to exalt the connection between machine vision 
and biological vision. 



BACKGROUND 

In the last decade some investigations on hybrid dual 
camera systems have been performed (Nayar 1997; 
Cui 1998; Adorni 2001; Adorni 2002; Adorni 2003; 
Scotti 2005; Yao 2006). The joint use of a moving 
standard camera and of a catadioptric sensor provides 
these sensors with their different and complementary 
features: while the traditional camera can be used to 
acquire detailed information about a limited region of 
interest ("foveal vision"), the omnidirectional sensor 
provides wide-range, but less detailed, information 
about the surroundings ("peripheral vision"). Pos- 
sible employments for this class of vision systems are 
video surveillance applications as well as mobile robot 
navigation tasks. Moreover, their particular configu- 
ration makes it possible to realize different strategies 
to control the orientation of the standard camera; for 
example, scattered focus on different objects permits to 
perform recognition/classification tasks while continu- 
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ous movements allow to track any interesting moving 
object. Three-dimensional reconstruction based on 
stereo vision is also possible. 



HOPS: HYBRID OMNIDIRECTIONAL 
PIN-HOLE SENSOR 

This article is focused on the latest prototype of the 
HOPS (Hybrid Omnidirectional-Pinhole Sensor) sen- 
sor (Adorni 200 1 ; Adorni 2002; Adorni 2003 , Cagnoni 
2007). HOPS is a dual camera vision system that 
achieves a high-resolution 3 60-degrees field of view as 
well as 3D reconstruction capabilities. The effective- 
ness of this hybrid sensor derives from the joint use of 
a traditional camera and a central catadioptric camera 
which both satisfy the single-viewpoint constraint. 
Having two different viewpoints from which the world 
is observed, the sensor can therefore act as a stereo 
pair finding effective applications in surveillance and 
robot navigation. 

To create a handy versatile system that could meet 
the requirements of the whole vision process in a wide 
variety of applications, HOPS has been designed to 
be considered as a single integrated object: one of the 
most direct advantages offered by this is that, once it is 



assembled and calibrated, it can be placed and moved 
anywhere (for example in the middle of a room ceil- 
ing or on a mobile robot) without any need for further 
calibrations. 

Figure 1 shows the latest two HOPS prototypes. In 
the one that has been used for the experiments reported 
here, the traditional camera which, in this version, can- 
not rotate, has been placed on top and can be pointed 
downwards with an appropriate fixed tilt angle to obtain 
a high-resolution view of a restricted region close to 
the sensor. In the middle, one can see the catadioptric 
camera consisting of a traditional camera pointing 
upwards to a hyperbolic mirror hanging over it and 
held by a plexiglas cylinder. As can be observed, the 
mirror can be moved up and down to permit optimal 
positioning (Swaminathan 200 1 ; Strelow 200 1 ) during 
calibration. 

Moreover, to avoid undesired light reflections on 
the internal surface of the Plexiglas cylinder, a black 
needle has been placed on the mirror apex as suggested 
in (Ishiguro 2001). Finally, in the lower part, some 
circuits generate video synchronization signals and 
allow for external connections. 

The newer version of HOPS (see Figure 1, right) 
overcomes some limitations of the present one. It 
uses two digital high-resolution Firewire cameras, 




Figure 1. The two latest versions of the HOPS sensor: the one used for experiments (left) and the newest version 
(right) which is currently being assembled and tested. 
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in conjunction with mega-pixel lenses characterized 
by a very low TV-distortion, to achieve better image 
quality. Furthermore, in this new version the traditional 
camera is hung to a stepper motor, controlled via a 
USB interface, and therefore is able to rotate. This 
time the traditional camera has been placed below the 
catadioptric part: this makes it possible to have no wires 
within the field of view of the omnidirectional image 
besides allowing, in surveillance applications, to see 
also the blind area of the omnidirectional view due to 
the reflection of the camera on the mirror. 

Sensor Calibration 

In order to extract metric information from two-di- 
mensional images, one must perform a calibration 
of the camera and estimate the geometric parameters 
needed to describe image formation. Therefore, after 
calibration, relationships between points on images and 
their real position in the 3D space can be expressed 
by mathematical equations which can solve metric 
problems. 

Sensor calibration can be based on a standard Pho- 
togrammetric Calibration (Kraus 1993; Zhang 2000) 
using a heavily structured environment with grids of 
points of known coordinates. First, the two cameras are 
calibrated independently, before assembling them on the 
sensor, to estimate their intrinsics as well as the radial 
distortion introduced by the optics. Then, the mirror is 
accurately positioned with respect to the camera in order 
to achieve single-viewpoint vision for the catadioptric 



part of the sensor as described by (Benosman 2001). 
The last, but probably most important, phase of the 
calibration is aimed at detecting geometric relationships 
between the traditional image and the omnidirectional 
one: once again, a set of known points was used to 
estimate the parameters of the mapping. 

Notice that the relationships that were computed 
between the two views are constant in time because 
of the sensor structure. In this way, once the calibra- 
tion procedure is over, no external fixed references 
are needed any longer, and one can place the sensor 
anywhere and perform stereo vision tasks without 
needing any further calibration. 

Mirror to Camera Positioning 

To position the hyperbolic mirror with respect to the 
standard camera and achieve the single-viewpoint 
characteristic for the catadioptric part of the sensor, 
one can operate as follows. 

Supposing that the single view-point constraint is 
satisfied, and since the mirror profile is known, the 
camera calibration data and some simple equations can 
be used to calculate the expected projections of any 
known 3D point set onto the omnidirectional image. 
To verify the correctness of the relative mirror-to- 
camera positioning, a calibration box has been built 
with grids of known coordinates painted on its inner 
walls. Hence, after placing the sensor into it, the mir- 
ror can be manually moved until the grids appearing 
on the image taken in real time match the theoretical 



Figure 2. Mirror position calibration: the sensor inside the calibration box (left) and the acquired omnidirec- 
tional image (right) with the correct grid positions superimposed in white. 
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ones super-imposed over it as they should appear if 
the mirror had been correctly placed (see Figure 2). 
This is a very cheap method which, however, yields 
very good results. 

Joint Camera Calibration 

To obtain a fully effective stereo system it is essential 
to make a joint camera calibration to extract informa- 
tion about the relative positioning of the camera pair. 
Usually, the internal reference frame of one of the two 
cameras is chosen as the global reference for the pair: 
since two different kinds of cameras are available, the 
simplest choice is to set the omnidirectional camera's 
frame as the global reference for the whole sensor. 
Using once again the above-mentioned grids of points, 
images pair (omnidirectional and traditional) of grids 
lying on different (parallel) planes with known relative 
positions are acquired. Once 3D coordinates of points 
positions, referred to the sensor reference frame, have 
been estimated through the omnidirectional image, 
solving for geometric constraints between points pro- 
jections in the traditional image permits to estimate the 
relative position of the traditional camera. 

To take the standard camera rotation into consider- 
ation, its position has to be described by a more complex 
transformation than a simply fixed ro to translation: the 
geometric and kinematic coupling between the two 
cameras has to be understood and modeled with more 
parameters. Obviously, this requires that images be 
taken with the traditional camera in many different 
positions. 

After this joint camera calibration, HOPS can be 
used to perform metric measurements on the images, 
obtaining three-dimensional data referred to its own 
global reference frame: this means that no further 
calibrations are needed to take sensor displacements 
into account. 

Perspective Reprojections & Inverse 
Perspective Mapping 

One of the opportunities offered by a perspective im- 
age is the possibility to apply an Inverse Perspective 
Mapping (IPM) transformation (Little 1991) to obtain 
a different image in which the information content is 
homogeneously distributed among all pixels. Since cen- 
tral catadioptric cameras are characterized by a single 
viewpoint, the images acquired by them are perspective 



images suitable to be used for IPM. Choosing a virtual 
image plane as the new domain for the IPM, a perspec- 
tive reprojection similar to traditional images can be 
obtained from part of those omnidirectional images. 

Figure 3 shows a pair of images acquired by HOPS 
and a perspective reconstruction of the omnidirectional 
view obtained applying an IPM on the corresponding 
area seen by the traditional camera. As can be noticed, 
the difference in resolution between the two perspective 
views is considerable. Choosing a horizontal plane as 
reference for the IPM, it is possible to obtain something 
very similar to an orthographic view of that area, usually 
referred to as "bird's eye view". If the floor is used as 
reference to perform IPM on both images, it is possible 
to extract useful information about objects/obstacles 
surrounding the system (Bertozzi 1998). 

3D Reconstruction Tests 

To verify the correctness of the calibration process, 
an estimation of the positions of points in a three-di- 
mensional space can be performed along with other 
tests. After capturing one image from each of the two 
views, the points in the test pattern are automatically 
detected and for each one the light rays from which it 
was generated are computed based on the projection 
model obtained during calibration. Since the estimated 
homologous rays are usually skew lines, the shortest 
segment j oining the two rays can be found and its middle 
point used as an estimate of the point's 3D position. 

In Table 1, results obtained using a 4x3 point 
test-pattern with 60 mm between point centers are 
reported. Even if the origin of the sensor reference 
system is physically inaccessible and no high-precision 
instruments were available, this pattern was placed as 
accurately as possible 390 mm perpendicularly ahead 
of the sensor itself (along the y direction in the chosen 
reference frame) and centered along the x direction: the 
z coordinates of the points in the top row were measured 
to be equal to 55 mm. This set-up is reflected by the 
estimated values for the first experiment reported in 
Table 1. More relevantly, the mean distance between 
points was estimated to be 59.45 mm with a standard 
deviation o = 1.14: those values are fully compatible 
with the resolution available for measuring distances 
on the test pattern and with the mirror resolution (also 
limited by image resolution). 

In a second experiment, a test-pattern with six points 
spaced by 110 mm, located about 1 m ahead, 0.25 m 
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Figure 3. Omnidirectional image (above, left) and traditional image (above, right) acquired by HOPS. Below a 
perspective reconstruction of part of the omnidirectional one is shown. 




to the right and a bit below the sensor, has been used. 
In the lower part of Table 1 the estimated positions 
are shown: the estimated mean distance was 109.09 
mm with a standard deviation o = 8.89. In another test 
with the same pattern located 1.3 m ahead, 0.6 m to 
the left and 0.5 m below the sensor (see Figure 4) the 
estimated mean distance was of about 102 mm with a 
standard deviation o = 9.98. 

It should be noticed that, at those distances, taking 
into account image resolution as well as the mirror 
profile, the sensor resolution is of the same order of 
magnitude as the errors obtained. Furthermore, the 
method used to find the center of circles suffers from 
luminance and contrast variations: substituting circles 
with adjacent alternate black and white squares and 



using a corner detector capable of sub-pixel accuracy 
would probably yield better results. 



FUTURE TRENDS 

Afield which nowadays draws great interest is autono- 
mous vehicle navigation. Even if at the moment there 
are still many problems to be solved before seeing 
autonomous public vehicles, industrial applications 
are already possible. Since results in omnidirectional 
visual servoing and ego-motion estimation are also ap- 
plicable to hybrid dual camera systems, and many more 
opportunities are offered by the presence of a second 
high-resolution view, the use of such devices in this field 
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Table 1. 3D estimation results: the tables show the estimated positions obtained. The diagrams below them show 
the estimated distances between points on the test-pattern. All values are in mm. 
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Figure 4. Omnidirectional image (left) and traditional image (right) acquired for a 3D stereo estimation test 
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is desirable. Even if most applications of these systems 
are related with surveillance, they could be applied 
even more directly to robot-aided human activities, 
since robots/vehicles involved in these situations are 
less critical and their controllability is easier. 



CONCLUSIONS 

The Hybrid Omnidirectional Pin-hole Sensor (HOPS) 
dual camera system has been described. Since its joint 
camera calibration leads to a fully calibrated hybrid 
stereo pair from which 3D information can be extracted, 
HOPS suits several kinds of applications. For example, 
it can be used for surveillance and robot self-localiza- 
tion or obstacle detection, offering the possibility to 
integrate stereo sensing with peripheral/foveal active 
vision strategies: once objects or regions of interest 
are localized on the wide-range sensor, the traditional 
camera can be used to enhance the resolution with 
which these areas can be analyzed. 

Tracking of multiple objects/people relying on 
high-resolution images for recognition and access 
control or estimating velocity, dimensions and tra- 
jectories are some examples of surveillance tasks for 
which HOPS is suitable. Accurate obstacle detection, 
landmark localization, robust ego-motion estimation 
or three-dimensional environment reconstruction are 
other examples of possible applications related to 
(autonomous/holonomous) robot navigation in semi- 
structured or completely unstructured environments. 
Some preliminary experiments have been performed 
to solve both surveillance and robot navigation with 
encouraging results. 
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KEY TERMS 

Camera Calibration: A procedure used to obtain 
geometrical information about image formation in a 
specific camera essential to relate metric distances on 
the image to distances in the real word. Anyway, some 
a priori information is needed to reconstruct the third 
dimension from only one image. 

Holonomous Robot: Arobot with an unconstrained 
freedom of movement with no preferential direction. 



This means that, from a standing position, it can move 
as easily in any direction. 

Inverse Perspective Mapping (IPM): Aprocedure 
which allows for perspective effect to be removed 
from an image by homogeneously redistributing the 
information content of the image plane into a new 
two-dimensional domain. 

Lens Distortion: Optical errors in camera lenses, 
usually due to mechanical misalignment of its parts, 
can cause straight lines in the observed scene to appear 
curved in the captured image. The deviation between 
the theoretical image and the actual one is mostly to 
be attributed to lens distortion. 

Pin-Hole Camera: A camera that uses a tiny hole 
(the pin-hole) to convey all rays from the observed 
scene to the image plane. The smaller the pin-hole, the 
sharper the picture. Pin-hole cameras achieve a poten- 
tially infinite depth of field. Because of its geometric 
simplicity, the "pin-hole model" is used to describe 
most traditional cameras. 

Single Viewpoint Constraint: When all incom- 
ing principal light rays of a lens intersect at a single 
point, an image with a non-distorted metric content is 
obtained. In this case all information contained in this 
image is seen from this view-point. 

Visual Servoing: An approach to robot control based 
on visual perception: a vision system extracts informa- 
tion from the surrounding environment to localize the 
robot and consequently servoing its position. 




847 



848 



Hybrid Dual Camera Vision Systems 



Stefano Cagnoni 

Universitd degli Studi di Parma, Italy 

Monica Mordonini 

Universitd degli Studi di Parma, Italy 

Luca Mussi 

Universitd degli Studi di Perugia, Italy 

Giovanni Adorni 

Universitd degli Studi di Genova, Italy 



INTRODUCTION 

Many of the known visual systems in nature are char- 
acterized by a wide field of view allowing animals to 
keep the whole surrounding environment under control. 
In this sense, dragonflies are one of the best examples: 
their compound eyes are made up of thousands of sepa- 
rate light-sensing organs arranged to give nearly a 360° 
field of vision. However, animals with eyes on the sides 
of their head have high periscopy but low binocular- 
ity, that is their views overlap very little. Differently, 
raptors' eyes have a central part that permits them to 
see far away details with an impressive resolution and 
their views overlap by about ninety degrees. Those 
characteristics allow for a globally wide field of view 
and for accurate stereoscopic vision at the same time, 
which in turn allows for determination of distance, lead- 
ing to the ability to develop a sharp, three-dimensional 
image of a large portion of their view. 

In mobile robotics applications, autonomous robots 
are required to react to visual stimuli that may come 
from any direction at any moment of their activity. In 
surveillance applications, the opportunity to obtain 
a field of view as wide as possible is also a critical 
requirement. For these reasons, a growing interest in 
omnidirectional vision systems (Benosman 2001), 
which is still a particularly intriguing research field, has 
emerged. On the other hand, requirements to be able to 
carry out object/pattern recognition and classification 
tasks are opposite, high resolution and accuracy and 
low distortion being possibly the most important ones. 
Finally, three-dimensional information extraction can 
be usually achieved by vision systems that combine the 
use of at least two sensors at the same time. 



This article presents the class of hybrid dual camera 
vision systems. This kind of sensors, inspired by existing 
visual systems in nature, combines an omnidirectional 
sensor with a perspective moving camera. In this way 
it is possible to observe the whole surrounding scene at 
low resolution, while, at the same time, the perspective 
camera can be directed to focus on objects of interest 
with higher resolution. 



BACKGROUND 

There are essentially two ways to observe a very wide 
area. It is possible to use many cameras pointed on non- 
overlapping areas or, conversely, a single camera with 
a wide field of view. In the former case, the amount of 
data to be analyzed is much bigger than in the latter one. 
In addition, calibration and synchronization problems 
for the camera network have to be faced. On the other 
hand, in the second approach the system is cheaper, 
easy to calibrate, while the analysis of a single image is 
straightforward. In this case, however, the disadvantage 
is a loss of resolution at which objects details are seen, 
since a wider field of view is projected onto the same 
area of the video sensor and thus described with the 
same amount of pixel as for a normal one . This was clear 
since the mid 1990s with the earlier experiments with 
omnidirectional vision systems. Consequently a num- 
ber of studies on omnidirectional sensors "enriched" 
with at least one second source of environmental data 
arose to achieve wide fields of view without loss of 
resolution. For example some work, oriented to robot- 
ics applications, has dealt with a catadioptric camera 
working in conjunction with a laser scanner as, to cite 
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only few recent, in (Kobilarov 2006; Mei 2006). More 
surveillance application-oriented work has involved 
multi-camera systems, joining omnidirectional and 
traditional cameras, while other work dealt with geo- 
metric aspects of hybrid stereo/multi-view relations, 
as in (Sturm 2002; Chen 2003). 

The natural choice to develop a cheap vision sys- 
tem with both omni-sight and high-detail resolution 
is to couple an omnidirectional camera with a moving 
traditional camera. In the sequel, we will focus on this 
kind of systems that are usually called "hybrid dual 
camera systems". 

Omnidirectional Vision 

There are two ways to obtain omnidirectional images. 
With a special kind of lenses mounted on a standard 
camera, called "fisheye lenses", it is possible to obtain 
a field of view up to about 180-degrees in both direc- 
tions. The widest fisheye lens ever produced featured 
a 220-degrees field of view. Unfortunately, it is very 
difficult to design a fisheye lens that satisfies the single 
viewpoint constraint. Although images acquired by 
fisheye lenses may prove to be good enough for some 
visualization applications, the distortion compensation 
issue has not been solved yet, and the high unit-cost is 
a major drawback for its wide-spread applications. 

Combining a rectilinear lens with a mirror is the 
other way to obtain omnidirectional views. In the so 
called "catadioptric lenses" a convex mirror is placed 
in front of a rectilinear lens achieving a field of view 
possibly even larger than with a fisheye lens. Using 
particularly shaped mirrors precisely placed with 



respect to the camera is also possible to satisfy the 
single viewpoint constraint and thus to obtain an image 
which is perspectively correct. Moreover, catadioptric 
lenses are usually cheaper than fisheye ones. In Figure 
1 a comparison between these two kinds of lenses can 
be seen. 



OVERVIEW OF HYBRID DUAL CAMERA 
SYSTEMS 

The first work concerning hybrid vision sensors is 
probably the one mentioned in (Nayar 1997) referred 
to as "Omnidirectional Pan/Tilt/Zoom System" where 
the PTZ unit was guided by inputs obtained from the 
omnidirectional view. The next year (Cui 1998) pre- 
sented a distributed system for indoor monitoring: a 
peripheral camera was calibrated to estimate the distance 
between a target and the projection of the camera on 
the floor. In this way, they were able to precisely direct 
the foveal sensor, of known position, to the target and 
track it. A hybrid system for obstacle detection in robot 
navigation was described in (Adorni 2001) few years 
later. In this work, a catadioptric camera was calibrated 
along with a perspective one as a single sensor: its 
calibration procedure permitted to compute an Inverse 
Perspective Mapping (IPM) (Little 1991) based on a 
reference plane, the floor, for both images and hence, 
thanks to the cameras' disparity, to detect obstacles 
by computing the difference between the two images. 
While this was possible only within the common field 
of view of the two cameras, awareness or even tasks 
such as ego-motion estimation were potentially pos- 




Figure 1. Comparison between image formation in fisheye lenses (left) and catadioptric lenses (right) 
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Figure 2. A pair of images acquired with the hybrid system described in (Cagnoni 2007). The omnidirectional 
image (left) and the perspective image (right). The different resolution of the two images is clearly visible. 





sible thanks to the omni-view. This system was further 
improved and mainly tested in RoboCup 1 applications, 
(Adorni 2002; Adorni 2003; Cagnoni 2007). In Figure 
2 it is possible to see a pair of images acquired with 
such a system. 

Some recent work has concentrated on using dual 
camera systems for surveillance applications. In (Scotti 
2005), when some alarm is detected on the omni- 
directional sensor, the PTZ camera is triggered and 
the two views start to track the target autonomously. 
Acquired video sequences and other metadata, like 
object classification information, are then used to 
update a distributed database to be queried later by 
users. Similarly in (Yao 2006), after the PTZ camera 
is triggered by the omnidirectional one, the target is 
tracked independently on the two views, but then a 
modified Kalman filter is used to perform data fusion: 
this approach achieves an improved tracking accuracy 
and permits to resolve occasional occlusions leading 
to a robust surveillance system. 



FUTURE TRENDS 



important role is already played by hybrid dual camera 
systems. The monitoring system installed between Eagle 
Pass, Texas, and Piedras Negras, Mexico, by engineers 
of the Computer Vision and Robotics Laboratory at 
the University of California, San Diego, affiliated 
with the California Institute for Telecommunications 
and Information Technology, is an example of a very 
complex surveillance system in which hybrid dual 
camera systems are involved (Hagen 2006). Because 
of the competitive cost, the compactness and the op- 
portunities offered by these systems, they are likely 
to be used more and more in the future in intelligent 
surveillance systems. 

Another field subjected to great interest is autono- 
mous vehicle navigation. Even if at the moment there 
are still many problems to be solved before seeing 
autonomous public vehicles, industrial applications are 
already possible. Since omnidirectional visual servoing 
and ego-motion estimation can actually be implemented 
also using hybrid dual camera systems, and many more 
opportunities are offered by the presence of a second 
high-resolution view, their future involvement in this 
field is desirable. 



Nowadays public order keeping, private property access 
control and security video surveillance are reasons for 
which we need to surveil wide areas of our environment. 
Surveillance is an ever growing market and automatic 
surveillance is an interesting challenge: many projects 
are oriented in this direction and in some of them an 



CONCLUSIONS 

The class of hybrid dual camera systems has been 
described and briefly overviewed. The joint use of a 
standard camera and of a catadioptric sensor provides 
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this kind of sensors with their different and complemen- 
tary features: while the traditional camera can be used 
to acquire detailed information about a limited region 
of interest ("foveal vision"), the omnidirectional sen- 
sor provides wide-range, but less detailed information 
about the surroundings ("peripheral vision"). 

Tracking of multiple objects/people relying on high- 
resolution images for recognition and access control 
or estimating object/people velocity, dimensions and 
trajectory are some examples of possible automatic 
surveillance tasks for which hybrid dual camera systems 
are suitable. Furthermore, their use in (autonomous) 
robot navigation, allows for accurate obstacle detec- 
tion, egomotion estimation and three-dimensional 
environment reconstruction. With one of these sensors 
on board, a mobile robot can be provided with all the 
necessary information needed to navigate safely in a 
dynamic environment. 
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KEY TERMS 

Camera Calibration: A procedure used to obtain 
geometrical information about image formation in 
a specific camera. After calibration, it is possible to 
relate metric distances on the image to distances in the 
real word. In any case only one image is not enough 
to reconstruct the third dimension and some a priori 
information is needed to accomplish this capability. 

Catadioptric Camera: A camera that uses in 
conjunction catoptric, reflective, lenses (mirrors) and 
dioptric, refractive, lenses. Usually the purpose of these 
cameras is to achieve a wider field of view than the one 
obtained by classical lenses. Even if the field of view 
of a lens could be improved with any convex surface 
mirror, those of greater interest are conic, spherical, 
parabolic and hyperbolic-shaped ones. 

Central Catadioptric Camera: Acamera that com- 
bines lenses and mirrors to capture a wide field of view 
through a central projection (i.e. a single viewpoint). 
Most common examples use paraboloidal or hyperbo- 
loidal mirrors. In the former case a telecentric lens is 
needed to focalize parallel rays reflected by the mirror 
and there are no constraints for mirror to camera relative 
positioning: the internal focus of the parabola acts as 
the unique viewpoint; in the latter case it is possible to 
use a normal lens, but mirror to camera positioning is 
critical for achieving a single viewpoint: it is essential 



that the principal point of the lens coincides with the 
external focus of the hyperboloid to let the internal one 
be the unique viewpoint for the observed scene. 

Omnidirectional Camera: A camera able to see 
in all directions. There are essentially two different 
methods to obtain a very wide field of view: the older 
one involves the use of a special type of lens, usually 
referred to as fisheye lens, while the other one uses in 
conjunction rectilinear lenses and mirrors. Lenses ob- 
tained in the latter case are usually called catadioptric 
lenses and the camera-lens ensemble is referred to as 
catadioptric camera. 

PTZ Camera: A camera able to pan left and right, 
tilt up and down, and zoom. It is usually possible to 
freely control its orientation and zooming status at a 
distance through a computer or a dedicated control 
system. 

Stereo Vision: A visual perception process that 
exploits two different views to achieve depth percep- 
tion. The difference between the two images, usually 
referred to as binocular disparity, is interpreted by the 
brain (or by an artificial intelligent system) as depth. 

Single Viewpoint Constraint: To obtain an image 
with a non-distorted metric content, it is essential that 
all incoming principal light rays of a lens intersect at a 
single point. In this case a fixed viewpoint is obtained 
and all the information contained in an image is seen 
from this point. 



ENDNOTE 

1 Visit http://www.robocup.org for more informa- 
tion. 



852 



853 



Hybrid Meta-Heuristics Based System for 
Dynamic Scheduling 




Ana Maria Madureira 

Polytechnic Institute of Porto, Portugal 



INTRODUCTION 

The complexity of current computer systems has led 
the software engineering, distributed systems and 
management communities to look for inspiration in 
diverse fields, e.g. robotics, artificial intelligence or 
biology, to find new ways of designing and managing 
systems. Hybridization and combination of different 
approaches seems to be a promising research field of 
computational intelligence focusing on the development 
of the next generation of intelligent systems. 

A manufacturing system has a natural dynamic 
nature observed through several kinds of random oc- 
currences and perturbations on working conditions and 
requirements over time. For this kind of environment it 
is important the ability to efficient and effectively adapt, 
on a continuous basis, existing schedules according to 
the referred disturbances, keeping performance levels. 
The application of Meta-Heuristics to the resolution 
of this class of dynamic scheduling problems seems 
really promising. 

In this article, we propose a hybrid Meta-Heuristic 
based approach for complex scheduling with several 
manufacturing and assembly operations, in dynamic 
Extended Job-Shop environments. Some self-adapta- 
tion mechanisms are proposed. 



BACKGROUND 

Scheduling Problem 

The planning of Manufacturing Systems involves 
frequently the resolution of a huge amount and variety 
of combinatorial optimisation problems with an im- 
portant impact on the performance of manufacturing 
organisations. Examples of those problems are the 
sequencing and scheduling problems in manufacturing 
management, routing and transportation, layout design 
and timetabling problems. 



Scheduling can be defined as the assignment of 
time-constrained jobs to time-constrained resources 
within a pre-defined time framework, which represents 
the complete time horizon of the schedule. An admis- 
sible schedule will have to satisfy a set of constraints 
imposed on jobs and resources. So, a scheduling 
problem can be seen as a decision making process for 
operations starting and resources to be used. A variety 
of characteristics and constraints related with jobs 
and production system, such as operation processing 
time, release and due dates, precedence constraints and 
resource availability, can affect scheduling decisions 
(Leung, 2004) (Brucker, 2004) (Blazewicz, Ecker 
&Trystrams, 2005) (Pinedo, 2005). 

Real world scheduling requirements are related with 
complex systems operated in dynamic environments. 
This means that they are frequently subject to several 
kinds of random occurrences and perturbations, such 
as new job arrivals, machine breakdowns, employees 
sickness, jobs cancellation and due date and time 
processing changes, causing prepared schedules becom- 
ing easily outdated and unsuitable. Scheduling under 
this environment is known as dynamic. 

Dynamic scheduling problems may be classified 
under deterministic, when release times and all other 
parameters are known and fixed, and under non-deter- 
ministic when some or all system and job parameters 
are uncertain, such as when jobs arrive randomly to 
the system, over time. 

Traditional heuristic scheduling methods, encoun- 
ter great difficulties when they are applied to some 
real- world situations. This is for three main reasons. 
Firstly, traditional scheduling methods use simplified 
and deterministic theoretical models, where all problem 
data are known before scheduling starts. However, 
many real world optimization problems are dynamic 
and non-deterministic and, in which changes may 
occur continually. In practice, static scheduling is not 
able to react dynamically and rapidly in the presence 
of dynamic information not previously foreseen in the 
current schedule. 
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Secondly, most of the approximation methods pro- 
posed for the Job-Shop Scheduling Problems (JSSP) 
are oriented methods, i.e. developed specifically for 
the problem in consideration. Some examples of this 
class of methods are the priority rules and the Shifting 
Bottleneck (Pinedo, 2005). 

Finally, traditional scheduling methods are essen- 
tially centralized in the sense that all the computations 
are carried out in a central computing and logic unit. All 
the information concerning every job and every resource 
has to go through this unit. This centralized approach 
is especially susceptible to problems of tractability, 
because the number of interacting entities that must be 
managed together is large and leads to a combinatorial 
explosion. Particularly since, a detailed schedule is 
generated over a long time horizon, and planning and 
execution are carried out in discrete buckets of time. 
Centralized scheduling is therefore large, complex, and 
difficult to maintain and reconfigure. On the other hand, 
the inherent nature of much industrial and service proc- 
ess is distributed. Consequently, traditional methods 
are often too inflexible, costly, and slow to satisfy the 
needs of real-world scheduling systems. 

By exploiting problem-specific characteristics, 
classical optimisation methods are not enough for the 
efficient resolution of those problems or are developed 
for specific situations (Leung, 2004) (Brucker, 2004) 
(Logie, Sabaz & Gruver, 2004) (Blazewicz, Ecker 
&Trystrams, 2005) (Pinedo, 2005). 

Meta-Heuristics 

As a major departure from classical techniques, a 
Meta-heuristic (MH) method implies higher-level 
strategy controlling lower-level heuristic methods. 
Meta-heuristics exploit not only the problem charac- 
teristics but also ideas based on artificial intelligence 
rationale, such as different types of memory structures 
and learning mechanisms, as well as the analogy with 
other optimization methods found in nature. 

The interest of the Meta-Heuristic approaches is 
that they converge, in general, to satisfactory solutions 
in an effective and efficient way (computing time and 
implementation effort). The family of MH includes, 
but it is not limited to Tabu Search, Simulated Anneal- 
ing, Soft Computing, Evolutionary Algorithms, Adap- 
tive Memory procedures, Scatter Search, Ant Colony 
Optimization, Swarm Intelligence, and their hybrids. 



For literature on this subject, see for example (Glover 
& Gary, 2003) and (Gonzalez, 2007). 

In last decades, there has been a significant level 
of research interest in Meta-Heuristic approaches for 
solving large real world scheduling problems, which are 
often complex, constrained and dynamic. Scheduling 
algorithms that achieve good or near optimal solutions 
and can efficiently adapt them to perturbations are, in 
most cases, preferable to those that achieve optimal ones 
but that cannot implement such an adaptation. This is 
the case with most algorithms for solving the so-called 
static scheduling problem for different setting of both 
single and multi-machine systems arrangements. This 
reality, motivated us to concentrate on tools, which could 
deal with such dynamic, disturbed scheduling problems, 
even though, due to the complexity of these problems, 
optimal solutions may not be possible to find. 

Several attempts have been made to modify algo- 
rithms, to tune them for optimization in a changing 
environment. It was observed in manufacturing all these 
studies, that the dynamic environment requires an algo- 
rithm to maintain sufficient diversity for a continuous 
adaptation to the changes of the landscape. Although 
the interest in optimization algorithms for dynamic 
optimization problems is growing and a number of 
authors have proposed an even greater number of new 
approaches, the field lacks a general understanding 
as to suitable benchmark problems, fair comparisons 
and measurement of algorithm quality (Branke, 1999) 
(Cowling & Johanson, 2002) (Madureira, 2003), Madu- 
reira, Ramos & Silva, 2004) (Aytug, Lawley, McKay, 
Mohan & Uzsoy, 2005). 

In spite of all the previous trials scheduling prob- 
lem still known to be NP-complete. This fact incites 
researchers to explore new directions. 

Hybrid Intelligent Systems 

Hybridization of intelligent systems is a promising 
research field of computational intelligence focusing 
on combinations of multiple approaches to develop the 
next generation of intelligent systems. An important 
stimulus to the investigations on Hybrid Intelligent 
Systems area is the awareness that combined approaches 
will be necessary if the remaining tough problems in 
artificial intelligence are to be solved. Meta-Heu- 
ristics, Bio-Inspired Techniques, Neural computing, 
Machine Learning, Fuzzy Logic Systems, Evolution- 
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ary Algorithms, Agent-based Methods, among others, 
have been established and shown their strength and 
drawbacks. Recently, hybrid intelligent systems are 
getting popular due to their capabilities in handling 
several real world complexities involving imprecision, 
uncertainty and vagueness (Boeres, Lima, Vinod & 
Rebello, 2003), (Madureira, Ramos & Silva, 2004) 
(Bartz-Beielstein, Blesa, Blum, Naujoks, Roli, Rudolph 
&Sampels, 2007). 



HYBRID META-HEURISTICS BASED 
SCHEDULING SYSTEM 

The purpose of this article is to describe an frame- 
work based on combination of Meta-Heuristics, 
Tabu Search(TS) and Genetic Algorithms(GA), and 
constructive optimization methods for solving a class 
of real world scheduling problems, where the products 
(jobs) to be processed have due dates, release times 
and different assembly levels. This means that parts 
to be assembled may be manufactured in parallel, i.e. 
simultaneously. 

The problem, focused in this work, which we call 
Extended Job-Shop Scheduling Problem (EJSSP) has 
maj or extensions and differences in relation to the classic 
Job-Shop Scheduling Problem. In this work, we define a 
job as a manufacturing order for a final item, that could 



be Simple or Complex. It may be Simple, like a part, 
requiring a set of operations to be processed. Complex 
Final Items, requiring processing of several operations 
on a number of parts followed by assembly operations 
at several stages, are also dealt with. Moreover, in prac- 
tice, scheduling environment tends to be dynamic, i.e. 
new jobs arrive at unpredictable intervals, machines 
breakdown, jobs can be cancelled and due dates and 
processing times can change frequently (Madureira, 
2003) (Madureira, Ramos & Silva, 2004). 

It starts focusing on the solution of the dynamic 
deterministic EJSSP problems. For solving these we 
developed a framework, leading to a dynamic schedul- 
ing system having as a fundamental scheduling tool, 
a hybrid scheduling system, with two main pieces of 
intelligence (Figure 1). 

One such piece is a combination of TS and GA 
based method and a mechanism for inter-machine 
activity coordination. The objective of this mechanism 
is to coordinate the operation of machines, taking into 
account the technological constraints of jobs, i.e. job 
operations precedence relationships, towards obtaining 
good schedules. The other piece is a dynamic adapta- 
tion module that includes mechanisms for neighbour- 
hood/population regeneration under dynamic environ- 
ments, increasing or decreasing it according new job 
arrivals or cancellations. 




Figure 1. Hybrid meta-heuristics based scheduling system 
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Adetailed description of the approach, methods and 
of its application to concrete problems can be found in 
Madureira (2003). 

Pre-Processing Module 

The pre-processing module deals with processing input 
information, namely problem definition and instantia- 
tion of algorithm components and parameters, such 
as, the initial solution and neighbourhood generation 
mechanisms, size of neighbourhood/population, tabu 
list attributes and tabu list length. 

Hybrid Scheduling Module 

Initially, we start by decomposing the deterministic 
EJSSP problem into a series of deterministic Single 
Machine Scheduling Problems (SMSP). We assume 
the existence of different and known job release times 
r., prior to which no processing of the job can be done 
and, also, job due dates d.. Based on these, release dates 
and due dates are determined for each SMSP and, sub- 
sequently, each such problem is solved independently 
by a TS or a GA(considering a self-parameterization 
issue). Afterwards, the solutions obtained for each 
SMSP are integrated to obtain a solution to the main 
EJSSP problem instance. 

The integration of the SMSP solutions may give an 
unfeasible schedule to the EJSSP. This is why schedule 
repairing may be necessary to obtain a feasible solu- 
tion. The repairing mechanism named Inter-Machine 
Activity Coordination Mechanism (IMACM) carries 
this out. The repairing is based on coordination of 
machines activity, having into account job operation 
precedence and other problem constraints. This is 
done keeping job allocation order, in each machine, 
unchanged. The IMACM establishes the starting and 
the completion times for each operation. It ensures that 
the starting time for each operation is the higher of the 
two following values: 

the completion time of the immediately precedent 
operation in the job, if there is only one, or the 
highest of all if there are more; 
the completion time of the immediately precedent 
operation on the machine. 



Dynamic Adaptation Module 

For non-deterministic problems some or all parameters 
are uncertain, i.e. are not fixed as we assumed in the 
deterministic problem. Non-determinism of variables 
has to be taken into account in real world problems. For 
generating acceptable solutions in such circumstances 
our approach starts by generating a predictive schedule, 
using the available information and then, if perturba- 
tions occur in the system during execution, the sched- 
ule may have to be modified or revised accordingly, 
i.e. rescheduling/dynamic adaptation is performed. 
Therefore, in this process, an important decision must be 
taken, namely that of deciding if and when rescheduling 
should happen. The decision strategies for reschedul- 
ing may be grouped into three categories: continuous, 
periodic and hybrid rescheduling. In the continuous one 
rescheduling is done whenever an event modifying the 
state of the system occurs. In periodic rescheduling, the 
current schedule is modified at regular time intervals, 
taking into account the schedule perturbations that 
have occurred. Finally, for the hybrid rescheduling the 
current schedule is modified at regular time intervals 
if some perturbation occurs. 

In the scheduling system for E JS SP, dynamic adapta- 
tion is necessary due to two classes of events: 

Partial events which imply variability in jobs or 
operations attributes such as processing times, 
due dates and release times. 
Total events which imply variability in neigh- 
bourhood structure, resulting from either new 
job arrivals or job cancellations. 

While, on one hand, partial events only require 
redefining job attributes and re-evaluation of the ob- 
jective function of solutions, total events, on the other 
hand, require a change on solution structure and size, 
carried out by inserting or deleting operations, and 
also re-evaluation of the objective function. Therefore, 
under a total event, the modification of the current 
solution is imperative. In this work, this is carried out 
by mechanisms described in (Madureira, Ramos & 
Silva, 2004) for SMSP. 

Considering the processing times involved and the 
high frequency of perturbations, rescheduling all jobs 
from the beginning should be avoided. However, if 
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work has not yet started and time is available, then an 
obvious and simple approach to rescheduling would 
be to restart the scheduling from scratch with a new 
modified solution on which takes into account the 
perturbation, for example a new job arrival. When 
there is not enough time to reschedule from scratch 
or job processing has already started, a strategy must 
be used which adapts the current schedule having in 
consideration the kind of perturbation occurred. 

The occurrence of a partial event requires redefini- 
tion of job attributes and are-evaluation of the schedule 
objective function. A change in job due date requires 
the re-calculation of the operation starting and comple- 
tion due times of all respective operations. However, 
changes in the operation processing times only requires 
re-calculation of the operation starting and completion 
due times of the succeeding operations. A new job 
arrival requires definition of the correspondent opera- 
tion starting and completion times and a regenerating 
mechanism to integrate all operations on the respective 
single machine problems. In the presence of a job can- 
cellation, the application of a regenerating mechanism 
eliminates the job operations from the SMSP where 
they appear. After the insertion or deletion of positions, 
neighbourhood regeneration is done by updating the size 
of the neighbourhood and ensuring a structure identical 
to the existing one. Then the scheduling module can 
apply the search process for better solutions with the 
new modified solution. 

Job Arrival Integration Mechanism 

When a new job arrives to be processed, an integration 
mechanism is needed. This analyses the job precedence 
graph that represents the ordered allocation of machines 
to each job operation, and integrates each operation 
into the respective single machine problem. Two al- 
ternative procedures could be used for each operation: 
either randomly select one position to insert the new 
operation into the current solution/chromosome or use 
some intelligent mechanism to insert this operation in 
the schedules, based on job priority, for example. 

Job Elimination Mechanism 



Regeneration Mechanisms 

After integration/elimination of operations is carried 
out, by inserting/deleting positions/genes in the current 
solution/chromosome, population regeneration is done 
by updating its size. The population size for SMSP is 
proportional to the number of operations. 

After dynamic adaptation process, the scheduling 
method could be applied and search for better solutions 
with the modified solution. 

In this way we proposed a hybrid system in which 
some self-organization aspects could be considered in 
accordance with the problem being solved: the method 
and/or parameters can change in run-time, the used 
MH can change according with problem characteris- 
tics, etc. 



FUTURE TRENDS 

Considering the complexity inherent to the manufac- 
turing systems, the dynamic scheduling is considered 
an excellent candidate for the application of agent- 
based technology. A natural evolution to the approach 
above proposed is a Multi-agent Scheduling System 
that assumes the existence of several Machines Agents 
(which are decision-making entities) distributed inside 
the Manufacturing System that interact and cooperate 
with other agents in order to obtain optimal or near- 
optimal global performances. 

The main idea is that from local, autonomous and 
often conflicting behaviours of the agents a global so- 
lution emerges from a community of machine agents 
solving locally their schedules and cooperating with 
other machine agents (Madureira, Gomes & Santos, 
2006). Agents must be able to learn and manage their 
internal behaviours and their relationships with other 
agents, by cooperative negotiation in accordance 
with business policies defined by user manager. Some 
self-organization aspects could be considered in ac- 
cordance with the problem being solved: the method 
and/or parameters can change in run-time, the agents 
can use different MH according with problem charac- 
teristics, etc. 




When a job is cancelled, an eliminating mechanism 
must be implemented so the correspondent position/ 
gene will be deleted from the solutions. 
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CONCLUSION 

This article proposes a system architecture that makes 
good use and combination of the advantages of two 
different Meta-Heuristics: Tabu Search and Genetic 
Algorithms. 

We believe that a new contribution for the resolution 
of more realistic scheduling problems, the Extended 
Job-Shop Problems was described. The particularity of 
our approach is the procedure to schedule operations, 
as each machine will first find local optimal or near 
optimal solutions, succeeded by the interaction with 
other machines trough cooperation mechanisms as a 
way to find an optimal global schedule, on dynamic 
environments. 

The proposed system is prepared to use other Lo- 
cal Search Meta-Heuristics, to drive schedules based 
on practically any performance measure and it is not 
restricted to a specific type of scheduling problems. 
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KEY TERMS 

Cooperation: The practice of individuals or enti- 
ties working together with common goals, instead of 
working separately in competition, and in which the 
success of one is dependent and contingent upon the 
success of the other. 
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Dynamic Scheduling Systems: Are frequently 
subject to several kinds of random occurrences and 
perturbations, such as new job arrivals, machine break- 
downs, employee's sickness, jobs cancellation and due 
date and time processing changes, causing prepared 
schedules becoming easily outdated and unsuitable. 

Evolutionary Computation: A subfield of artifi- 
cial intelligence that involve techniques implementing 
mechanisms inspired by biological evolution such as 
reproduction, mutation, recombination, natural selec- 
tion and survival of the fittest. 

Genetic Algorithms: Particular class of evolu- 
tionary algorithms that use techniques inspired by 
evolutionary biology such as inheritance, mutation, 
selection, and crossover. 

Hybrid Intelligent Systems: Denotes a software 
system which employs, a combination of Artificial 
Intelligence models, methods and techniques, such 
Evolutionary Computation, Meta-Heuristics, Multi- 
Agent Systems, Expert Systems and others. 



Meta-Heuristics: Form a class of powerful and 
practical solution techniques for tackling complex, 
large-scale combinatorial problems producing effi- 
ciently high-quality solutions. 

Multi- Agent Systems: A system composed of 
several agents, collectively capable of solve complex 
problems in a distributed fashion without the need for 
each agent to know about the whole problem being 
solved. 

Scheduling: Can be seen as a decision making 
process for operations starting and resources to be 
used. A variety of characteristics and constraints related 
with jobs and machine environments (Single Machine, 
Parallel machines, Flow-Shop and Job-Shop) can affect 
scheduling decisions. 

Tabu Search: A approximation method, belonging 
to the class of local search techniques, that enhances 
the performance of a local search method by using 
memory structures (Tabu List). 
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INTRODUCTION 

Crying in babies is a primary communication function, 
governed directly by the brain; any alteration on the 
normal functioning of the babies' body is reflected 
in the cry (Wasz-Hockert, et a/, 1968). Based on the 
information contained in the cry's wave, the infant's 
physical state can be determined; and even pathologies 
in very early stages of life detected (Wasz-Hockert, et 

a/, 1970). 

To perform this detection, a Fuzzy Relational Neural 
Network (FRNN) is applied. The input features are 
represented by fuzzy membership functions and the 
links between nodes, instead of weights, are represented 
by fuzzy relations (Reyes, 1994). This paper, as the 
first of a two parts document, describes the Infant Cry 
Recognition System's architecture as well as the FRNN 
model. Implementation and testing are reported in the 
complementary paper. 



BACKGROUND 

The pioneer works on infant cry were initiated by 
Wasz-Hockert since the beginnings of the 60 's. In one 
of those works his research group showed that the four 
basic types of cry can be identified by listening: pain, 
hunger, pleasure and birth. Further studies led to the 
development of conceptual models that describe the 
anatomical and physiologic basis of the production 
and neurological control of crying (Bosma, Truby & 
Antolop, 1965). Later on, Wasz-Hockert (1970) applied 



spectral analysis to identify several types of crying. 
Other works showed that there exist significant differ- 
ences among the several types of crying, like healthy 
infant's cry, pain cry and pathological infant's cry. In 
one study, Petroni used Neural Networks (Petroni, 
Malowany, Johnston, and Stevens, 1995) to differen- 
tiate between pain and no-pain crying. Cano directed 
several works devoted to the extraction and automatic 
classification of acoustic characteristics of infant cry. 
In one of those studies, in 1999 Cano presented a work 
where he demonstrates the utility of the Kohonen's 
Self-Organizing Maps in the classification of Infant Cry 
Units (Cano-Ortiz, Escobedo-Becerro, 1999) (Cano, 
Escobedo and Coello, 1 999). More recently, in (Orozco, 
& Reyes, 2003) we reported the classification of cry 
samples from deaf and normal babies with feed- for- 
ward neural networks. In 2004 Cano and his group, in 
(Cano, Escobedo, Ekkel, 2004) reported a radial basis 
network (RBN) to find out relevant aspects concerned 
with the presence of Central Nervous System (CNS) 
diseases. In (Suaste, Reyes, Diaz, and Reyes, 2004) 
we showed the implementation of a Fuzzy Relational 
Neural Network (FRNN) for Detecting Pathologies by 
Infant Cry Recognition. 

The study of connectionist models also known 
as Artificial Neural Networks (ANN) has enjoyed 
a resurgence of interest after its demise in the 60's. 
Research was focused on evaluating new neural net- 
works for pattern classification, training algorithms 
using real speech data, and on determining whether 
parallel neural network architectures can be designed 
to perform efficiently the work required by complex 
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speech recognition algorithms (Lippmann, 1990). In 
the connectionist approach, pattern classification is 
done with a multi-layer neural network. A weight is 
assigned to every link between neurons in contiguous 
layers. In the input layer each neuron receives one of 
the features present in the input pattern vectors. Each 
neuron in the output layer corresponds to each speech 
unit class (word or sub-word). The neural network 
associates input patterns to output classes by model- 
ing the relationship between the two pattern sets. The 
pattern is estimated or learned by the network with 
a representative sample of input and output patterns 
(Morgan, and Scofield, 1991) (Pedrycz, 1991).. In 
order to stabilize the perceptron's behavior, many 
researchers had been trying to incorporate fuzzy set 
theory into neural networks. The theory of fuzzy sets, 
developed by Zadeh in 1965 (Zadeh, 1965), has since 
been used to generalize existing techniques and to 
develop new algorithms in pattern recognition. Pal 
(Pal, 1 992a) suggested that to enable systems to handle 
real-life situations, fuzzy sets should be incorporated 
into neural networks, and, that the increase in the 
amount of computation required with its incorpora- 
tion, is offset by the potential for parallel computation 
with high flexibility that fuzzy neural networks have. 
Pal proposes how to do data fuzzification, the general 



system architecture of a fuzzy neural network and the 
use of 3n-dimensional vectors to represent the fuzzy 
membership values of the input features to the primary 
linguistic properties low, medium, and high (Pal, 1 992a) 
and (Pal, and Mandal, 1992b). On the other side, the 
idea of using a relational neural network as a pattern 
classifier was developed by Pedrycz and presented in 
(Pedrycz, 1991). As a result of the combination of the 
Pal's and Pedrycz's proposed methodologies in 1994 
C . A. Reyes ( 1 994) developed the hybrid model known 
as fuzzy relational neural network (FRNN). 



THE AUTOMATIC INFANT CRY 
RECOGNITION PROCESS 

The infant cry automatic classification process is, in 
general, a pattern recognition problem, similar to Auto- 
matic Speech Recognition (ASR) (Huang, Acero, Hon, 
2001). The goal is to take the wave from the infant's 
cry as the input pattern, and at the end obtain the kind 
of cry or pathology detected on the baby (Cano, Esc- 
obedo and Coello, 1999) (Ekkel, 2002). Generally, the 
process of Automatic Infant Cry Recognition is done in 
two steps. The first step is known as signal processing, 
or feature extraction, whereas the second is known as 




Figure 1. Automatic infant cry recognition process 
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pattern classification. In the acoustical analysis phase, 
the cry signal is first normalized and cleaned, and then 
it is analyzed to extract the most important features in 
function of time. The set of obtained features is repre- 
sented by a vector, which represents a pattern. The set 
of all vectors is then used to train the classifier. Later 
on, a set of unknown feature vectors is compared with 
the acquired knowledge to measure the classification 
output efficiency. Figure 1 shows the different stages 
of the described recognition process. 

Cry Patterns Classification 

The vectors, representing patterns, obtained in the 
extraction stage are later used in the classification 
process. There are four basic schools for the solution 
of the pattern classification problem, those are: a) Pat- 
tern comparison (dynamic programming), b) Statistic 
Models (Hidden Markov Models HMM), c) Knowledge 
based systems (expert systems), and d) Connectionists 
Models (neural networks). In recent years, anew strong 
trend of more robust hybrid classifiers has been emerg- 
ing. Some of the better known hybrid models result 
from the combination of neural and fuzzy approaches 
(Jang, 1993) (Lin Chin-Teng, and George Lee, 1996). 
For the work shown here, we have implemented a 
hybrid model of this type, called the Fuzzy Relational 
Neural Network, whose parameters are found trough 
the application of genetic algorithms. We selected this 
kind of model, because of its adaptation, learning and 
knowledge representation capabilities. Besides, one of 
its main functions is to perform pattern recognition. 

In an Automatic Infant Cry Classification System, 
the goal is to identify a model of an unknown pattern 
obtained after the original sound wave is acoustically 
analyzed, and its dimensionality reduced. So, in this 
phase we determine the class or category to which 
each cry pattern belongs to. The collection of samples, 
each of which is represented by a vector of n features, 
is divided in two subsets: The training set and the test 
set. First, the training set is used to teach the classifier 
to distinguish between the different crying types. Then 
the test set is used to determine how well the classifier 
assigns the corresponding class to a pattern by means of 
the classification scheme generated during training. 



THE FUZZY NEURAL NETWORK 
MODEL 

The system proposed in this work is based upon fuzzy 
set operations in both; the neural network's structure 
and the learning process. Following Pal's idea of a 
general recognizer (Pal, S.K., 1992a), the model is 
divided in two main parts, one for learning and another 
for processing, as shown in Figure 2. 

Fuzzy Learning 

The fuzzy learning section is composed by three mod- 
ules, namely the Linguistic Feature Extractor (LFE), 
the Desired Output Estimator (DOE), and the Neural 
Network Trainer (NNT). The Linguistic Feature Extrac- 
tor takes training samples in the form of n-dimensional 
vectors containing n features, and converts them to 
iVn-dimensional form vectors, where N is the number 
of linguistic properties. In this case the linguistic 
properties are low, medium, and high. The resulting 
in-dimensional vector is called Linguistic Properties 
Vector (LPV). In this way an input pattern F. = [F a , 
F. 2 , ...,F in ] containing n features, may be represented 
as (Pal, and Mandal, 1992b) 



R = 



^high(F n ) \Ti )>"> ^high(F in )\Fi ) 

The DOE takes each vector from the training 
samples and calculates its membership to class k, in 
an /-class problem domain. The vector containing the 
class membership values is called the Desired Vector 
(DV). Both LPV and DV vectors are used by the neural 
Network Trainer (NNT), which takes them for training 
the network. 

The neural network has only one input and one 
output layer. The input layer is formed by a set of 
Nn neurons, with each of them corresponding to one 
of the linguistic properties assigned to the n input 
features. In the output layer there are / neurons, with 
each node corresponding to one of the / classes; in 
this implementation, each class represents one type 
of crying. There is a link from every node in the input 
layer to every node in the output layer. All the con- 
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Figure 2. General architecture of the automatic infant cry recognition system 
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nections are described by means of fuzzy relations R: 
X x Y— > [0, 1] between the input and output nodes. 
The error is represented by the distance between the 
actual output and the target or desired output. During 
each learning step, once the error has been computed, 
the trainer adjusts the relationship values or weights of 
the corresponding connections, either until a minimum 
error is obtained or a given number of iterations are 
completed. The output of the NNT, after the learning 
process, is a fuzzy relational matrix (R in Figure 1) 
containing the knowledge needed to further map the 
unknown input vectors to their corresponding class 
during the classification process. 



the learning phase, described in the previous section. 
The output of this module is an LPV vector, which 
along with the fuzzy relational matrix R, are used by the 
Fuzzy Classifier, which obtains the actual outputs from 
the neural network. The classifier applies the max-min 
composition to calculate the output. The output of this 
module is an output vector containing the membership 
values of the input vector to each of the classes. Finally, 
the Decision Making module selects the highest value 
from the classifier and assigns the corresponding class 
to the testing vector. 

Membership Functions 



Fuzzy Processing 

The fuzzy processing section is formed by three differ- 
ent modules, namely the Linguistic Feature Extractor 
(LFE), the Fuzzy Classifier (FC), and the Decision 
Making Module (DMM). The LFE works as the one in 



A membership function maps values in a domain to 
their membership value in a fuzzy set. Several kinds 
of membership functions are available. In the reported 
experiments triangular membership functions were 
used. According to (Park, Cae, and Kandel, 1992) the 
use of more linguistic properties to describe a pattern 
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point makes a model more accurate, but too many can 
make the description unpractical. So, here we use seven 
linguistic properties: very low, low, more or less low, 
medium, more or less high, high, and very high. 

Desired Membership Values 

Before defining the output membership function, we 
define the equation to calculate the weighted distance 
of the training pattern F to the /cth class in an /-class 
problem domain as in (Pal, 1992a) 



Y = XoR 



I i =1 






,: for k = l,...,l 



where Fij is the jth feature of the z'th pattern vector, c k . 
denotes the mean, andu, , denotes the standard deviation 
of the jth feature for the /cth class. The membership value 
of the z'th pattern to class k is defined as follows 



mfc)= 



1 



1 + 



f \ 
z t1r 



\td j 



,:\l k (F,)e[0,l] 



fe 



where fe is the exponential fuzzy generator, and fd is 
the denominational fuzzy generator controlling the 
amount of fuzzines in this class-membership set. In 
this case, the higher the distance of the pattern from 
a class, the lower its membership to that class. Since 
the training data have fuzzy class boundaries, a pat- 
tern point usually belongs to more than one class at 
different degrees. 

The Neural Network Trainer 

The neural network model discussed here is based on 
the relational neural structure proposed by Pedrycz in 
(Pedrycz,W., 1991). 

The Relational Neural Network (RNN): Let X = 
{xi, X2,. . ., xn} be a finite set of input nodes and let Y 
= {yi, y2,..., yi} represent the output nodes set in an 
/-class problem domain. When the max-min composi- 
tion operator denoted X° R is applied to a fuzzy set X 
and a fuzzy relation R, the output is a new fuzzy set 
Y, we have 



y( yj .)=max x (min^x,.),^,^))) 



(1) 



where X is a fuzzy set, 7 is the resulting fuzzy set and R 
a fuzzy relation R : X* Y—> [0, 1] describing all relation- 
ships between input and output nodes. We will take the 
whole neural network represented by expression ( 1 ) as 
a collection of / separate n-input single-output cells. 

Learning in a Fuzzy Neural Network: If the 
actual response from the network does not match the 
target pattern; the network is corrected by modifying 
the link weights to reduce the difference between the 
observed and target patterns. To measure the difference 
a performance index called equality index is defined, 
which is 



T(y)=Y(y)= 



l+r(y)-Y(y)zf Y(y)>T(y) 
1+Y(y)-T(y)if Y(y)<T(y) 
l,if Y(y)=T(y) 



where T(y) is the target output at node y, and Y (y) is 
the actual output at the same node. In a problem with 
n input patterns, there are n input-output pairs (xij, ti) 
where ti is the target value when the input is X... 

Parameters Updating: Pedricz also proposes to 
complete the process of learning separately for each 
output node. The learning algorithm is a version of the 
back-propagation algorithm. Let's consider an n-input- 
L-output neural network having the following form 



i n / V 

y t = ffoa/u^ vlVV 



U=i 



where a = [ai,a2, . . . , ol] is a vector containing all 
the weights or relations, xi = [xn, xn, . . . , xm] is the 
vector with the values observed in the input nodes. The 
parameters a and v are updated iteratively by taking 
increment Aam resulting from deviations between all 
pairs yi and ti as follows 



a(/c + l)=a(/c)+ v P 1 (fc) 



Aa(k + l) +rj Aa(/c)- 



Nn 



Nn 



where Zeis the learning step. Wi and Wi are non-increas- 
ing functions of k controlling the decreasing influence 
of increments Aam. W is the learning momentum 
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specifying the level of modification of the learning 
parameters with regard to their values in the previous 
learning step k. A way of determining the increments 
Aam is with regard to the mth coordinates of a, m = 1 , 
2,..., L. The computation of the overall performance 
index, and the derivatives to calculate the increments 
for each coordinate of a, and v are explained in detail 
in (Reyes, C. A., 1994). Once the training has been 
terminated, the output of the trainer is the updated 
relational matrix, which will contain the knowledge 
needed to map unknown patterns to their correspond- 
ing classes. 



FUTURE TRENDS 

One unexplored possibility of improving the FRNN 
performance is the use of other fuzzy relational 
products instead of max-min composition. Moreover, 
membership functions have parameters which can be 
optimized by genetic algorithms any other optimizing 
technique. Adequate parameters may improve learning 
and recognition efficiency of the FRNN. 



CONCLUSIONS 

We have presented the development and implementa- 
tion of an AICR system as well as a powerful hybrid 
classifier, the FRNN, which is a model formed by the 
combination of fuzzy relations and artificial neural 
networks. The synergistic symbiosis obtained though 
the fusion of both methodologies will be demonstrated. 
In the related paper on applications of this model, we 
will show some practical results, as well as an improved 
model by means of genetic algorithms. 
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KEY TERMS 

Artificial Neural Networks: A network of many 
simple processors that imitates a biological neural 
network. The units are connected by unidirectional 
communication channels, which carry numeric data. 
Neural networks can be trained to find nonlinear re- 
lationships in data, and are used in applications such 
as robotics, speech recognition, signal processing or 
medical diagnosis. 

Automatic Infant Cry Recognition (AICR): A 

process where the crying signal is automatically ana- 
lyzed, to extract acoustical features looking to determine 
the infant's physical state, the cause of crying or even 
detect pathologies in very early stages of life. 

Back propagation Algorithm: Learning algorithm 
of ANNs, based on minimising the error obtained from 
the comparison between the outputs that the network 
gives after the application of a set of network inputs 
and the outputs it should give (the desired outputs). 

Fuzzy Relational Neural Network (FRNN): A 

hybrid classification model combining the advantages 
of fuzzy relations with artificial neural networks. 

Fuzzy Sets: A generalization of ordinary sets by 
allowing a degree of membership for their elements. 
This theory was proposed by Lofti Zadeh in 1965. 
Fuzzy sets are the base of fuzzy logic. 

Hybrid Intelligent System: A software system 
which employs, in parallel, a combination of methods 
and techniques from Soft Computing. 

Learning Stage: A process to teach classifiers to 
distinguish between different pattern types. 
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INTRODUCTION 

Automatic Infant Cry Recognition (AICR) process is 
basically a problem of pattern processing, very similar 
to the Automatic Speech Recognition (ASR) process 
(Huang, Acero, Hon, 2001). In AICR first we perform 
acoustical analysis, where the crying signal is analyzed 
to extract the more important acoustical features, like; 
LPC, MFCC, etc. (Cano, Escobedo andCoello, 1999). 
The obtained characteristics are represented by feature 
vectors, and each vector represents a pattern. These 
patterns are then classified in their corresponding 
pathology (Ekkel, 2002). In the reported case we are 
automatically classifying cries from normal, deaf and 
asphyxiating infants. 

We use a genetic algorithm to find several optimal 
parameters needed by the Fuzzy Relational Neural 
Network FRNN (Reyes, 1994), like; the number of 
linguistic properties, the type of membership function, 
the method to calculate the output and the learning 
rate. The whole model has been tested on several data 
sets for infant cry classification. The process, as well 
as some results, is described. 



BACKGROUND 

In the first part of this document a complete description 
of the AICR system as well as of the FRNN is given. 
So, with continuity purposes, in this part we will con- 
centrate in the description of the genetic algorithm and 
the whole system implementation and testing. 



A genetic algorithm refers to a model introduced 
and investigated by John Holland (John Holland, 1 975) 
and by students of Holland (DeJong, 1975). Genetic 
algorithms are often viewed as function optimizers, 
although the range of problems to which genetic al- 
gorithms have been applied is quite broad. Recently, 
numerous papers and applications combining fuzzy 
concepts and genetic algorithms (GAs) have become 
known, and there is an increasing concern in the integra- 
tion of these two topics. In particular, there are a great 
number of publications exploring the use of GAs for 
developing or improving fuzzy systems, called genetic 
fuzzy systems (GFSs) (Cordon, Oscar, et a/, 2001) 
(Casillas, Cordon, del Jesus, Herrera, 2000). 



EVOLUTIONARY DESIGN 

Within the evolutionary techniques, perhaps one of the 
most popular is the genetic algorithm (AG) (Goldberg, 
1989). Its structure presents analogies with the biologi- 
cal theory of evolution, and is based on the principle 
of the survival of the fittest individual (Holland, 
1975). Generally, a genetic algorithm has five basic 
components (Michalewicz, 1992). A representation 
of potential solutions to the problem, a form to create 
potential initial solutions, a fitness function that is in 
charge to evaluate solutions, genetic operators that alter 
the offspring's composition, and values for parameters 
like the size of the population, crossover probability, 
mutation probability, number of generations and oth- 
ers. Here we present different features of the genetic 
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algorithm used to find a combination of parameters 
for the FRNN. 

Chromosomal Representation 

The binary codification is used in genetic algorithms, 
and Holland in (Holland, 1975) gave a theoretical 
justification to use it. Holland argued that the binary 
codification allows having more schemes than a decimal 
representation. Scheme is a template that describes a 
subgroup of strings that share certain similarities in 
some positions throughout their length (Goldberg, 
1989). The problem variables consist of the number of 
linguistic properties, the type of membership function, 
the classification method and the learning rate. We are 
interested in having between 3 and 7 linguistic proper- 
ties, so, the number of linguistic variables is encoded 
into a binary string of 3 bit length. The membership 
function is represented as a 2 bit string, where [00] 
decodes the Trapezoidal membership function, [01] 
decodes the n function, [10] decodes the Triangular 
function, [11] decodes the Gaussian membership 
function. The classification methods are also coded 
as a 2 bit string, where [00] represents the max-min 
composition, [0 1 ] represents the geometrical mean and 
[10] represents the relational square product. Finally, 
the learning rate is represented as a binary string of 3 
bit length, where [000] decodes to 0.1 learning rate, 
[001] decodes to 0.2 learning rate, [010] decodes to 
0.31 learning rate, [011] decodes to 0.4 learning rate, 
and [100] decodes to 0.5 learning rate. A larger learn- 
ing rate is not desirable, so all other bit values are 
ignored. The chromosome is obtained by concatenating 
all the above strings. Figure 1 shows an example of 
the chromosomal representation. Initial population is 
generated from a random selection of chromosomes, 
a population size of 50 was considered. 



Genetic Operations 

We use four genetic operations, namely elitism, roulette 
wheel selection, crossover and mutation. Elitism: In 
order to ensure that the members with highest fitness 
value of the population stay in the next generation 
we apply elitism. It has been demonstrated (Giinter, 
Rudolph, 1 994), that a genetic algorithm must use elit- 
ism to be able to show convergence. At each iteration 
of the genetic algorithm we select the members with 
the four highest fitness values and we put them in the 
next generation. 

Selection: In the genetic algorithm the selection 
process is made in a probabilistic way, it is to say, the 
less apt individuals even have a certain opportunity to 
be selected. There are many different types of selection 
approaches; we use the roulette wheel selection, where 
members of the population have a probability of being 
selected that is directly proportionate to their fitness. 
Crossover: In this work we use a single point crossover. 
Observing the performance of different crossover op- 
erators, De Jong (De Jong, K., 1975) concluded that, 
although increasing the number of points of crosses 
affects its schemes from a theoretical perspective, in 
practice this does not seem to have a significant impact. 
The crossover is the principal operator in the genetic 
algorithm. Based on some experiments we decided 
to determine the crossover point randomly and the 
crossover probability was fixed at 0.8. Mutation: This 
operator allows the introduction of new chromosomal 
material in the population. We selected a gene randomly 
and we replaced it by its complement, a zero is changed 
by a one and a one is changed by a zero. Some authors 
suggest that the mutation probability equal to \IL, 
where L is the length of the chain of bits is an inferior 
limit acceptable for the optimal percentage of muta- 
tion (Back, Thomas, 1993). In this work the mutation 
probability is fixed at 0.05. 



Figure 1. Chromosomal representation 
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Fitness Function 

The objective function of our optimization problem 
is called fitness function. This function must be able 
to penalize the solutions that are not good and award 
the good ones so they can propagate quickly (Coello, 
Carlos A., 1 995). As a fitness function we use the clas- 
sification error given by the Fuzzy Relational Neural 
Network. Then the fitness function is defined by the 
following equation 

F = e 

In this case we define the classification error as 
follows 

_ No.PM 

eFRNN ~^oT 

where TVa PMrepresents the number of perfect matches, 
in other words, it represents the number of samples 
classified correctly. The term No.S represents the total 
number of given samples to the FC. 



IMPLEMENTATION AND RESULTS 



Signal Processing 



sented by a vector in an n-dimensional space. Each 
vector represents a pattern. 

For the present experiments we work with samples 
of infant cries. The infant cries were collected by record- 
ings done directly by medical doctors and then, each 
signal wave was divided in segments of 1 second, each 
segment represents a sample. Then, acoustic features 
were obtained by means of techniques as Linear Predic- 
tion Coefficients (LPC) and Mel Frequency Cepstral 
Coefficients (MFCC), by the use of the freeware pro- 
gram Praat v4.0. 8 (Boersma, P. , Weenink, 2002). Every 
sample of 1 second is divided in frames of 50-millisec- 
onds and from each frame we extract 16 coefficients, 
this procedure generates vectors whit 304 coefficients 
by sample. In this paper we show the results obtained 
with Mel Frequency Cepstral Coefficients. 

In order to reduce the dimensions of the sample 
vectors we apply Principal Component Analysis. The 
FRNN and the genetic algorithm are implemented in 
Matlab. We have a corpus of 157 samples of normal 
infant cry, 340 of asphyxia infant cry, and 879 of hypo 
acoustics. Also we have a corpus of 192 samples of 
pain and 350 samples of hunger crying. We worked 
with a population of 50 individuals and the number 
of training epochs for the FRNN was set at three. The 
initial population was randomly chosen. The number 
of generations needed for the genetic algorithm was 
of only three. These values were set on the basis of the 
observation of the results of several experiments. 




The analysis of the raw cry waveform provides the 
information needed for its recognition. At the same 
time, it discards unwanted information such as back- 
ground noise, and channel distortion (Levinson S.E., 
and Roe, D.B., 1990). Acoustic feature extraction is a 
transformation of measured data into pattern data. Some 
of the most important techniques used for analyzing cry 
wave signals are: Discrete Fourier Transform (DFT), 
cepstral processing, and Linear Prediction Analysis 
(LPA) (Ainsworth, W.A., 1988) (Schafer and Rabiner 
1 990). The application of these techniques during signal 
processing obtains the values of a set of acoustic fea- 
tures. The features may be spectral coefficients, linear 
prediction coefficients (LPC), Mel frequency cepstral 
coefficients (MFCC), among others (Ainsworth, W. A., 
1988). The set of values for n features may be repre- 



Preliminary Results 

Three different classification experiments were made, 
the first one consists in classifying deaf and normal 
infant cry, the second one was made to classify infant 
cry in categories called asphyxia and normal, and the 
third one to classify hunger and pain crying. In each 
task the training samples and the test samples are ran- 
domly selected. The results of the model in the clas- 
sification of deaf and normal cry are given in Table I. 
In Table II we show the results obtained in the second 
classification task. Finally Table III shows the results 
in the classification of hunger and pain cry. In every 
classification task the GA was run about 15 times and 
the reported results show the average of the best clas- 
sification in each experiment. 
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Table 1. Results of classifying deaf and normal cry 



Characteristics 


Successful 
codification 


Interpretation 


Accuracy 


Number of 

linguistic 

properties 


Oil 


3 


98% 


Membership 
function 


01 


n 


Classification 
method 


00 


max-min 


Learning rate 


001 


0.2 



Table 2. Results of classifying asphyxia and normal cry 



Characteristics 


Successful 
codification 


Interpretation 


Accuracy 


Number of 

linguistic 

properties 


011 


3 


84% 


Membership 
function 


01 


n 


Classification 
method 


01 


geometrical 
mean 


Learning rate 


10 


0.31 



Table 3. Results of classifying hunger and pain cry 



Characteristics 


Successful 
codification 


Interpretation 


Accuracy 


Number of 

linguistic 

properties 


111 


7 


95.24% 


Membership 
function 


01 


n 


Classification 
method 


00 


max-min 


Learning rate 


010 


0.31 
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Performance Comparison with Other 
Models 

Reyes and Orozco (Orozco, Reyes, 2003) classified 
cry samples from deaf and normal babies, obtaining 
recognition results around 97.43%. Reyes etal (Suaste, 
Reyes, Diaz, Reyes, 2004) showed an implementation 
of a linguistic fuzzy relational neural network to clas- 
sify normal and pathological infant cry with percentage 
of correct classification of 97.3% and 98%. Petroni, 
Malowany, Johnston and Stevens ( 1 995) classified cry 
from normal babies to identify pain with artificial neural 
networks and report results of correct classification 
that go from 61% with cascade-correlation networks 
up to 86.2% with feed-forward neural networks. In 
(Lederman, 2002) Dror Lederman presents some clas- 
sification results for infants with respiratory distress 
syndrome RDS (related to asphyxia) versus healthy 
infants . For the classification he used a Hidden Markov 
Model architecture with 8 states and 5 Gaussians/state. 
The results reported are of 63 % of total mean correct 
classification. 



FUTURE TRENDS 

AICR systems may expand their utility by training them 
to recognize a larger number of pathologies. The first 
requirement to achieve this goal is to collect a suitable 
set of labeled samples for any target pathology. The 
GA presented here optimizes some parameters of the 
FRNN, but the model has more. So, other parameters 
can be added to the chromosomal representation in 
order to improve the model, like initial values of the 
relational matrix and of the bias vectors, number of 
training epochs, and the values of the exponential fuzzy 
generator and the denominational fuzzy generator used 
by the DOE. 



CONCLUSIONS 

The proposed genetic algorithm computes a selection 
of the number of linguistic properties, the membership 
function used to calculate the linguistic features, the 
method to calculate the output of the classifier in the 
fuzzy processing section and the learning rate of the 
FRNN. The solution obtained by the proposed genetic 
algorithm is a set of characteristics that the FRNN can 



use to make the classification of infant cry. The use of 
linguistic properties allows us to deal with the impre- 
ciseness of infant cry and provides the classifier with 
very useful information. By applying the linguistic 
information and given the nature of the model, it is 
not necessary to get training through a high number 
of learning epochs, a high number of iterations in the 
genetic algorithm is not necessary either. The results of 
classifying deaf and normal infant cry are very similar 
to other models, but when we classify hunger and pain 
the results are much better than other models. 
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KEY TERMS 

Binary Chromosome: Is an encoding scheme rep- 
resenting one potential solution to a problem, during a 
searching process, by means of a string of bits. 

Evolutionary Computation: A subfield of com- 
putational intelligence that involves combinatorial 
optimization problems. It uses iterative progress, such 
as growth or development in a population, which is 
then selected in a guided random search to achieve 
the desired end. Such processes are often inspired by 
biological mechanisms of evolution. 

Fitness Function: It is a function defined over the 
genetic representation and measures the quality of the 
represented solution. The fitness function is always 
problem dependent. 

Genetic Algorithms: A family of computational 
models inspired by evolution. These algorithms encode 
a potential solution to a specific problem on a simple 
chromosome-like data structure and apply recombi- 
nation operators to these structures so as to preserve 
critical information. Genetic algorithms are often 
viewed as function optimizers, although the range of 
problems to which genetic algorithms have been ap- 
plied is quite broad. 

Hybrid Intelligent System: A software system 
which employs, in parallel, a combination of methods 
and techniques mainly from subfields of Soft Com- 
puting. 

Signal Processing: The analysis, interpretation 
and manipulation of signals. Processing of such sig- 
nals includes storage and reconstruction, separation 
of information from noise, compression, and feature 
extraction. 

Soft Computing: Apartnership of techniques which 
in combination are tolerant of imprecision, uncertainty, 
partial truth, and approximation, and whose role model 
is the human mind. Its principal constituents are Fuzzy 
Logic (FL), Neural Computing (NC), Evolutionary 
Computation (EC) Machine Learning (ML) and Proba- 
bilistic Reasoning (PR). 
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INTRODUCTION 

Graphics Processing Units (GPUs) have been evolving 
very fast, turning into high performance programmable 
processors. Though GPUs have been designed to com- 
pute graphics algorithms, their power and flexibility 
makes them a very attractive platform for general- 
purpose computing. In the last years they have been 
used to accelerate calculations in physics, computer 
vision, artificial intelligence, database operations, etc. 
(Owens, 2007). 

In this paper an approach to general purpose com- 
puting with GPUs is made, followed by a description 
of artificial intelligence algorithms based on Artificial 
Neural Networks (ANN) and Evolutionary Computa- 
tion (EC) accelerated using GPU. 



BACKGROUND 

General-Purpose Computation using Graphics Process- 
ing Units (GPGPU) consists in the use of the GPU as 
an alternative platform for parallel computing taking 
advantage of the powerful performance provided by 
the graphics processor (General-Purpose Computation 
Using Graphics Hardware Website; Owens, 2007). 

There are several reasons that justify the use of 
the GPU to do general-purpose computing (Luebke, 
2006): 

Last generation GPUs are very fast in comparison 
with current processors. For instance, a NVIDIA 
8800 GTX card has computing capability of ap- 
proximately 330 GFLOPS, whereas an Intel Core2 
Duo 3.0 GHz processor has only a capability of 
about 48 GFLOPS. 

GPUs are highly-programmable. In the last years 
graphical chip programming capacities have 
grown very much, replacing fixed-programming 



engines with programmable ones, like pixel and 

vertex engines. Moreover, this has derived in the 

appearance of high-level languages that help its 

programming. 

GPUs evolution is faster than CPU's one. The 

increase in GPU's performance is nowadays from 

1.7x to 2.3x per year, whereas in CPUs is about 

1.4x. The pressure exerted by videogame market 

is one of the main reasons of this evolution, what 

forces companies to evolve graphics hardware 

continuously. 

GPUs use high-precision data types . Although in the 

very beginning graphics hardware was designed to 

work with low-precision data types, at the present 

time internal calculations are computed using 32 

bits float point numbers. 

Graphics cards have low cost in relation to the 

capacities that they provide. Nowadays, GPUs are 

affordable for any user. 

GPUs are highly-parallel and they can have multiple 

processors that allow making high-performance 

parallel arithmetic calculations. 

Nevertheless, there are some obstacles. First, not all 
the algorithms fit for the GPU's programming model, 
because GPUs are designed to compute high-intensive 
parallel algorithms (Harris, 2005). Second, there are 
difficulties in using GPUs, due mainly to: 

GPU's programming model is different from 
CPU's one. 

GPUs are designed to graphics algorithms, there- 
fore, to graphics programming. The implementation 
of general-purpose algorithms on GPU is quite 
different to traditional implementations. 
Some limitations or restrictions exist in program- 
ming capacities. Most functions on GPU's program- 
ming languages are very specific and dedicated to 
make calculations in graphics algorithms. 
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GPU's architectures are quite variable due to 
their fast evolution and the incorporation of new 
features. 

Therefore it is not easy to port an algorithm devel- 
oped for CPUs to run in a GPU. 

Overview of the Graphics Pipeline 

Nowadays GPUs make their computations following a 
common structure called Graphics Pipeline. The Graph- 
ics Pipeline (Akenine-Moller, 2002) is composed by a 
set of stages that are executed sequentially inside the 
GPU, allowing the computing of graphics algorithms. 
Recent hardware is made up of four main elements. 
First, the vertex processors, that receive vertex arrays 
from CPU and make the necessary transformations 
from their positions in space to the final position in the 
screen. Second, the primitive assembly build graphics 
primitives (for instance, triangles) using information 
about connectivity between different vertex. Third, in 
the rasterizer, those graphical primitives are discretized 
and turned into fragments. A fragment represents a 
potential pixel and contains the necessary information 
(color, depth, etc.) to generate the final color of a pixel. 
Finally, in the fragment processors, fragments become 
pixels to which final color is written in a target buffer, 
that can be the screen buffer or a texture. 

In the present, GPUs have multiple vertex and frag- 
ment processors that compute operations in parallel. 
Both are programmable using little pieces of code 
called vertex and fragment programs, respectively. 
In the last years different high-level programming 
languages have released like Cg/HLSL (Mark, 2003; 
HLSL Shaders) or GLSL (OpenGL Shading Language 
Information Site), that make easier the programming 
of those processors. 

The GPU Programming Model 

There is a big difference between programming CPUs 
and GPUs due mainly to their different programming 
models. GPUs are based on the stream programming 
model (Owens, 2005a; Luebke, 2006; Owens, 2007), 
where all data are represented by a stream that can 
be defined as a sorted set of data of the same type. A 
kernel operates on full streams, and takes input data 
from one or more streams to produce one or more 
output streams. The main characteristic of a kernel is 



that it operates on the whole stream, instead individual 
elements. The typical use of a kernel is the evaluation 
of a function over each element from an input stream, 
calling this a map operation. Other operations of a 
kernel are expansions, reductions, filters, etc. (Buck, 
2004; Horn, 2005; Owens, 2007). The kernel generated 
outputs are always based on their input streams, what 
means that inside the kernel, the calculations made on 
an element never depends of the other ones. In stream 
programming model, applications are built connecting 
multiple kernels. An application can be represented as 
a dependency graph where each graph node is a kernel 
and each edge represents a data stream between kernels 
(Owens, 2005b; Lefohn, 2005). 

The behavior of graphic pipeline is similar to the 
stream programming model. Data flows through each 
stage, where the output feeds the next one. Stream 
elements (vertex or fragment arrays) are processed 
independently by kernels (vertex or fragment programs) 
and their output can be received again by another 
kernels. 

The stream programming model allows an efficient 
computation, because kernels operate on independent 
elements from a set of input streams and can be pro- 
cessed using hardware like GPU, that process vertex 
or fragments streams in parallel. This allows making 
parallel computing without the complexity of traditional 
parallel programming models. 

Computational Resources on GPU 

In order to implement any kind of algorithm on GPU, 
there are different computational resources (Harris, 
2005; Owens, 2007). By one side, current GPUs have 
two different parallel programmable processors: vertex 
and fragment processors. Vertex processors compute 
vertex streams (points with associated properties like 
position, color, normal, etc.). A vertex processor applies 
a vertex program to transform each input vertex to its 
position on the screen. Fragment processors compute 
fragment streams. They apply a fragment program to 
each fragment to calculate the final color of the pixel. In 
addition of using the attributes of each fragment, those 
processors can access to other data streams like textures 
when they are generating each pixel. Textures can be 
seen as an interface to access to read-only memory. 

Another available resource in GPU is the rasterizer. 
It generates fragments using triangles built in from 
vertex and connectivity information. The rasterizer 
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allows generating an output set of data from a smaller 
input one, because it interpolates the properties of each 
vertex that belongs to a triangle (like color, texture 
coordinates, etc.) for each generated fragment. 

One of the essential features of GPUs is the render- 
to-texture one. This allows storing the pixels generated 
by the fragments processor in a texture, instead of a 
screen buffer. This is at the moment the only mechanism 
to obtain directly output data from GPU computing. 
Render-to-texture cannot be thought as an interface to 
read- write memory, due to the fact that fragment pro- 
cessor can read data from a texture in multiple times, 
but it can write there just one time, at the end of each 
fragment processing. 



ARTIFICIAL INTELLIGENCE 
ALGORITHMS ON GPU 

Using the stream programming model as well as 
resources provided by graphics hardware, Artificial 
Intelligence algorithms can be parallelized and therefore 
computing-accelerated. The parallel and high-intensive 
computing nature of this kind of algorithms makes them 
good candidates for being implemented on the GPU. 

Consider the evolution process of genetic algo- 
rithms, where a fitness value needs to be computed for 
each individual. Population could be considered as a 
data stream and fitness function as a kernel to process 
this stream. On GPU, for instance, the data stream 
must be represented as a texture, whereas the kernel 
must be implemented on a fragment program. Each 
individual's fitness would be obtained in an output 
stream, represented also by a texture, and obtained by 
the use of render-to-texture feature. 

Recently some works have been realized mainly 
in paralleling ANN and EC algorithms, described in 
following sections. 

Artificial Neural Networks 

Bohn (1998) used GPGPU to reduce training time 
in Kohonen's feature maps. In this case, the bigger 
the map, the higher was the time reduction using the 
GPU. On 128x128 sized maps, time was similar us- 
ing CPU and GPU, but on 512x512 sized maps, GPU 
was almost 3.5 times faster than CPU, increasing to 
5.3 faster rates on 1024x1024 maps. This was one of 
the first implementations of GPGPU, made on a non- 



programmable graphics system, a SiliconGraphics 
Infinite Reality workstation. 

Later, with programmable hardware, Oh (2004) 
used the GPU for accelerating the process of obtaining 
the output of a multilayer perceptron ANN. Developed 
system was applied to pattern recognition obtaining 20x 
lower computing time than CPU implementation.. 

Considering another kind of ANNs, Zhongwen 
(2005) used GPGPU to reduce computing time in 
training Self-Organizing Maps (SOMs). The bigger 
the SOM, the higher was the reduction. Whereas 
using 128x128 neurons maps computing time was 
similar between CPU and GPU, 512x512 neuron 
maps involved a training process 4x faster using GPU 
implementation. 

Bernhard (2005) used GPU to simulate Spiking 
Neurons model. This ANN model both requires high 
intensive calculations and has a parallel nature, so fits 
very well on GPGPU computation. Authors made differ- 
ent implementations depending on the neural network 
application. In the first case, an image segmentation 
algorithm was implemented using a locally-excitatory 
globally-inhibitory Spiking Neural Network (SNN). 
In this experiment, authors obtained up to lOx faster 
results. In the second case, SNNs were used to image 
segmentation using an algorithm based on histogram 
clustering where the ANN minimized the objective 
function. Here the speed was improved up to 10 times 
also. 

Seoane (2007) showed multilayer perceptron 
ANN training time acceleration using GA. GPGPU 
techniques for ANN computing allowed accelerating 
it up to 11 times. 

The company Evolved Machines (Evolved Machines 
Website) uses the powerful performance of GPUs to 
simulating of neural computation, obtaining results up 
to lOOx faster than CPU computation. 

Evolutionary Computation 

In EC related works, Yu (2005) describes how parallel 
genetic algorithms can be mapped in low-cost graphics 
hardware. In their approach, chromosomes and fitness 
values are stored in textures. Fitness calculation and 
genetic operators were implemented using fragment 
programs on GPU. Different population sizes applied to 
the Colville minimization problem were used for testing, 
resulting in better time reductions according to bigger 
populations. In the case of a 128x128 sized population, 
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GPU genetic operators computing was 11.8 times faster 
than CPU, whereas in a 5 1 2x5 1 2 sized population, that 
rate incremented to 20. 1 . In fitness function computing, 
rates were 7.9 and 17.1 respectively. 

In another work, Wong (2006) implemented Hybrid 
Genetic Algorithms on GPU incorporating the Cauchy 
mutation operator. All algorithm steps were imple- 
mented in graphics hardware, except random number 
generation. In this approach, a pseudo-deterministic 
method was proposed for selecting process, allowing 
significant running-time reductions. GPU implementa- 
tion was 3x faster than CPU's one. 

Fok (2007) showed how to implement evolutionary 
algorithms on GPU. Since the crossover operators of 
GA requires more complex calculations than mutation 
ones, authors studied a GPU implementation of Evolu- 
tionary Programming, using only mutation operators. 
Tests have been proved with the Cauchy distribution to 
5 different optimization problems, obtaining between 
1.25 and 5 times faster results. 



FUTURE TRENDS 

Nowadays GPUs are very powerful and they are evolv- 
ing quite fast. By one side, there are more and more 
programmable elements in GPUs; by the other one, 
programming languages are becoming full-featured. 
There are more and more implementations of different 
kinds of general-purpose algorithms that take advantage 
of these features. 

In Artificial Intelligence field the number of devel- 
opments is rather low, in spite of the great amount of 
current algorithms and their high computing require- 
ments. It seems very interesting using GPUs to extend 
existent implementations. For instance, some examples 
of speeding ANNs simulations up have been shown, 
however there is no works in accelerating training times. 
Likewise same ideas can be applied to implement other 
kinds of ANNs architectures or IA techniques, like in 
genetic programming field, where there is neither any 
development. 



CONCLUSION 

This paper has introduced general-purpose program- 
ming on GPUs. They have been shown as powerful 
parallel processors, which programming capabilities 



allow using for general-purpose high-intensive com- 
puting algorithms. Based on this idea, existent imple- 
mentations of IA models like ANN or EC on GPUs 
have been described, with a considerable computing 
time reduction. 

General-purpose computing on GPU and its use to 
accelerating IA algorithms provides great advantages, 
being an essential contribution in application where 
computing time is a decisive factor. 
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KEY TERMS 

Fragment: Potential pixel containing all the nec- 
essary information (color, depth, etc.) to generate the 
final fragment color. 

Fragment Processor: Graphics system element 
that receives as input a set of fragments and processes 
it to obtain pixel, writing them in a target buffer. Pres- 
ent GPUs have multiple fragment processors working 
in parallel and can be programmed using fragment 
programs. 

Graphics Pipeline: Three dimensional graphics 
oriented architecture, composed by several stages that 
run sequentially. 

Graphics Processing Unit (GPU): Electronic de- 
vice designed for graphics rendering in computers. Its 
architecture is specialized in graphics calculations. 

General-Purpose Computation on GPUs (GP- 
GPU): Trend in computing devices dedicated to 
implement general-purpose algorithms using graph- 
ics devices, called GPUs. At the moment, the high 
programmability and performance of GPUs allow 
developers run classical algorithms in these devices to 
speed non-graphics applications up, especially those 
algorithms with parallel nature. 

Pixel: Picture Element abbreviation, used for refer- 
ring graphic image points. 

Rasterizer : Graphics Pipeline element, which from 
graphic primitives provides appropriate fragments to 
a target buffer. 

Render-to-Texture: GPU feature that allows stock- 
ing the fragment processor output on a texture instead 
on a screen buffer. 

Stream Programming Model: This parallel pro- 
gramming model is based on defining, by one side, 
sets of input and output data, called streams, and by 
the other side, intensive computing operations, called 
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kernel functions, to be applied sequentially on the 
streams. 

Texture: In computer graphics field, it refers to a 
digital image used to modify the appearance of a tri- 
dimensional object. The operation that wraps around 
a texture over an object is called texture mapping. 
Talking about GPGPU, a texture can be considered as 
a data stream. 

Vertex: In computer graphics field, it refers to a 
clearly defined point in a tridimensional space, which 
is processed by Graphics Pipeline. Relationships can 
be established between those vertices (like triangles) 
to assembly structures that define a tridimensional 
object. Talking about GPGPU, an vertex array can be 
considered as a data stream. 

Vertex Processor: Graphics system component 
that receives as input a set of 3D vertex and process 
them to obtain 2D screen positions. Present GPUs have 
multiple vertex processors working in parallel and can 
be programmed using vertex programs. 
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INTRODUCTION 

ABayesian Network (BN) takes a relationship between 
graphs and probability distributions. In the past, BN 
was mainly used for knowledge representation and 
reasoning. Recent years have seen numerous success- 
ful applications of BN in classification, among which 
the Naive Bayes classifier was found to be surprisingly 
effective in spite of its simple mechanism (Langley, 
Iba & Thompson, 1992). It is built upon the strong 
assumption that different attributes are independent 
with each other. Despite of its many advantages, a 
major limitation of using the Naive Bayes classifier 
is that the real-world data may not always satisfy the 
independence assumption among attributes. This strong 
assumption could make the prediction accuracy of the 
Naive Bayes classifier highly sensitive to the correlated 
attributes. To overcome the limitation, many approaches 
have been developed to improve the performance of 
the Naive Bayes classifier. 

This article gives a brief introduction to the ap- 
proaches which attempt to relax the independence 
assumption among attributes or use certain pre-process- 
ing procedures to make the attributes as independent 
with each other as possible. Previous theoretical and 
empirical results have shown that the performance of 
the Naive Bayes classifier can be improved significantly 
by using these approaches, while the computational 
complexity will also increase to a certain extent. 



decision trees (Friedman, Geiger & Goldszmidt, 1 997). 
Owing to these advantages, the Naive Bayes classifier 
has gained great popularity in solving different clas- 
sification problems. Nevertheless, its independence 
assumption among attributes is often violated in the 
real world. Fortunately, many approaches have been 
developed to alleviate this problem. 

In general, these approaches can be divided into 
two groups. One attempts to relax the independence 
assumption of Naive Bayes classifier, e.g. Semi-Na- 
ive Bayes (SNB) (Kononenko, 1991), Searching for 
dependencies (Pazzani, 1995), the Tree Augmented 
Naive Bayes (TAN) (Friedman, Geiger & Goldszmidt, 
1997), SuperParent Tree Augmented Naive Bayes 
(SP-TAN) (Keogh & Pazzani, 1 999), Lazy Bayes Rule 
(LBR) (Zheng & Webb, 2000) and Aggregating One- 
Dependence Estimators (AODE) (Webb, Boughton & 
Wang, 2005). 

The other group attempts to use certain pre-process- 
ing procedures to select or transform the attributes, 
which can be more suitable for the assumption of the 
Naive Bayes classifier. The Feature selection can be 
implemented by greedy forward search (Langley & 
Sage, 1994) and Decision Trees (Ratanamahatana & 
Gunopulos, 2002). The transformation techniques in- 
clude Principal Component Analysis (PC A) (Gupta, 
2004), Independent Component Analysis (ICA) 
(Prasad, 2004) and CC-ICA (Bressan & Vitria, 2002). 
The next section describes the main ideas of the two 
groups of techniques in a broad way. 



BACKGROUND 

The Naive Bayes classifier, also called simple Bayesian 
classifier, is essentially a simple BN. Since no structure 
learning is required, it is very easy to construct and 
implement a Naive Bayes classifier. Despite its simplic- 
ity, the Naive Bayes classifier is competitive with other 
more advanced and sophisticated classifiers such as 



IMPROVING THE NAIVE BAYES 
CLASSIFIER 

This section introduces the two groups of approaches that 
have been used to improve the Naive Bayes classifier. 
In the first group, the strong independence assumption 
is relaxed by restricted structure learning. The second 
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group helps to select some major (and approximately 
independent) attributes from the original attributes or 
transform them into some new attributes, which can 
then be used by the Naive Bayes classifier. 

Relaxing the Independence Assumption 

Relaxing the independence assumption means that 
the dependence will be considered in constructing 
the network. To consider the dependencies between 
attributes, Kononenko (Kononenko, 1991) proposed 
the Semi-Naive Bayes classifier (SNB), which joined 
the attributes based on the theorem of Chebyshev. 
The medical diagnostic data were used to compare 
the performance of the SNB and the NB. It was found 
that the results of two domains are identical but in the 
other two domains SNB slightly improves the perform- 
ance. Nevertheless, this method may cause overfitting 
problems. Another limitation of the SNB is that the 
number of parameters will grow exponentially with 
the increase of the number of attributes that need to be 
joined. In addition, the exhaustive searching technique 
of j oining attributes may affect the computational time. 
Pazzani (Pazzani, 1995) used Forward Sequential Se- 
lection and Joining (FSSJ) and Backward Sequential 
Elimination and Joining (BSE J) to search dependencies 
and join the attributes. They tested the two methods 
on UCI data and found that BSE J provided the most 
improvement. 

Friedman et al. (Friedman, Geiger & Goldszmidt, 
1997) found that Kononenko 's and Pazzani's methods 
can be represented as an augmented Naive Bayes 
network, which includes some subgraphs. They 
restricted the network to be Tree Augmented Naive 
Bayes (TAN) that spans over all attributes and can 
be learned by tree-structure learning algorithms. The 
results based on problems from the UCI repository 
showed that the TAN classifier outperforms the Naive 
Bayes classifier. It is also competitive with C4.5 while 
maintains the computational simplicity. However, 
the use of the TAN classifier is only limited to the 
problems with discrete attributes. For the problems 
with continuous attributes, these attributes must be 
prediscretized. To address this problem, Friedman et 
al. (Friedman, Goldszmidt & Lee, 1998) extended TAN 
to deal with continuous attributes via parametric and 
semiparametric conditional probabilities. Keogh & 
Pazzani (Keogh & Pazzani, 1999) proposed a variant 
of the TAN classifier, i.e. SP-TAN, which could result 



in better performance than TAN. The performance of 
SP-TAN is also competitive with the Lazy Bayes Rule 
(LBR), in which the lazy learning techniques are used 
in the Naive Byes classifier (Zheng, & Webb, 2000; 
Wang & Webb, 2002) 

Although LBR and SP-TAN have outstanding per- 
formance on the testing data, the main disadvantage of 
the two methods is that they have high computational 
complexity. Aggregating One-Dependence Estimators 
(AODE), developed by Webb et al. (Webb, Boughton 
& Wang, 2005), can avoid model selection which may 
reduce computational complexity and lead to lower 
variance. These advantages have been demonstrated by 
some empirical experiment results. It is also empirically 
found that the average prediction accuracy of AODE is 
comparative to that of LBR and SP-TAN but with lower 
variance. Therefore, AODE might be more suitable for 
small datasets due to its lower variance. 

Using Pre-Processing Procedures 

In general, the pre-processing procedures for the 
Naive Bayes classifier include feature selection and 
transforming the original attributes. The Selective 
Bayes classifier (SBC) (Langley & Sage, 1994) deals 
with correlated features by selecting only some at- 
tributes into the final classifier. They used a greedy 
method to search the space and forward selection to 
select the attributes. In their study, six UCI datasets are 
used to compare the performance of the Naive Bayes 
classifier, SBC and C4.5. It is found that selecting the 
attributes can improve the performance of the Naive 
Bayes classifier when there are redundant attributes. 
In addition, SBC is found to be competitive with C4.5 
in terms of the datasets by which C4.5 outperforms the 
Naive Bayes classifier. The study by Ratanamahatana 
& Gunopulos (Ratanamahatana & Gunopulos, 2002) 
applied C4. 5 to select the attributes for the Naive Bayes 
classifier. Interestingly, experimental results showed 
that the new attributes obtained by C4.5 can make the 
Naive Bayes classifier outperform C4.5 with respect 
to a number of datasets. 

Transforming the attributes is another useful pre- 
processing procedure for the Naive Bayes classifier. 
Gupta (Gupta, 2004) found that Principal Component 
Analysis (PCA) was helpful to improve the classifica- 
tion accuracy and reduce the computational complexity. 
Prasad (Prasad, 2004) applied Independent Compo- 
nent Analysis (ICA) to all the training data and found 
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that the performance ofNaive Bayes classifier integrated 
ICA performed better than C4.5 and IB1 integrated 
with ICA. Bressan and Vitria (Bressan & Vitria, 2002) 
proposed the class-conditional ICA (CC-ICA) to do 
pre-processing procedure for the Naive Bayes classifier, 
and found that CC-ICA based Naive Bayes classifier 
outperformed the pure Naive Bayes classifier. 

Based on the UCI datasets, a detailed comparative 
study of PCA, ICA and CC-ICA for Naive Bayes clas- 
sifier has been carried out by Fan & Poh (Fan & Poh, 
2007). PCA attempts to transform the original data 
into a new uncorrected dataset, while ICA attempts 
to transform them into a new dataset with independent 
attributes. Class-conditional ICA (CC-ICA), proposed 
by Bressan and Vitria (2002), is built upon the idea that 
ICA is used to make the attributes as independent as 
possible for each class. In such a way, the new attributes 
are more reasonable than those from the PCA and ICA 
in order to satisfy the independence assumption of the 
Naive Bayes classifier. 

The datasets were limited to the continuous datas- 
ets due to the requirement of the three pre-processing 
procedures. The results showed that all the three pre- 
processing procedures can improve the performance 
of the Naive Bayes classifier. It is likely due to the 
fact that transforming the attributes could weaken the 
dependence among different attributes. In addition, 
the discrepancy between the performance of ICA 
and PCA integrated with the Naive Bayes classifier 
is not large. This may be an indication that PCA and 
ICA are competitive in improving the performance of 
Naive Bayes classifier. When the number of attributes 
became larger, the three pre-processing procedures 
also improved the performance of the Naive Bayes 
classifier by more. 

From the methodological point of view, the CC-ICA 
pre-processing procedure seems to be more plausible 
than PCA and ICA for Naive Bayes classifier (Bressan 
and Vitria, 2002; Vitria, Bressan, & Radeva, 2007). The 
experimental results by Fan & Poh (Fan & Poh, 2007) 
also showed that CC-ICA integrated with the Naive 
Bayes classifier outperforms PCA and ICA integrated 
with the Naive Bayes classifier in terms of classification 
accuracy. However, CC-ICA requires more training 
data to ensure that there are enough training data for 
each class. It is therefore suggested that the choice of 
a suitable pre-processing procedure should depend on 
the characteristics of datasets, e.g. the sample size for 
each class. 



FUTURE TRENDS 

With the development of the algorithms for learning BN, 
relaxing the independence assumption is promising for 
improving the performance of the Naive Bayes classi- 
fier. However, relaxing the independence assumption 
to the unrestricted BN is not appropriate. Friedman et 
al. (Friedman, Geiger, & Goldszmidt, 1997) compared 
the Naive Bayes classifier and Bayesian Network and 
found that using unrestricted BN did not improve the 
accuracy. On the contrary, it even reduced the accuracy 
in some domains. Therefore, other restricted BN may 
be used for improving the performance while keeping 
the simplicity of the Naive Bayes classifier. Effective 
and simple learning algorithm is also important for the 
improving the performance. 

On the other hand, with the development of al- 
gorithms for machine learning, more pre-processing 
procedures are expected to be developed for selecting 
or transforming the attributes. One possible way to 
get better performance is to combine feature selection 
with transformation techniques to do the pre-process- 
ing procedures. Among the alternative techniques for 
doing pre-processing procedures, the most promising 
one might be ICA. The reason is that the motivation of 
the pre-processing procedures is to derive the attributes 
satisfying the independence assumption for the Naive 
Bayes classifier while the objective of ICA is to find 
the independent components. However, there are also 
some limitations on the use of ICA, e.g. the require- 
ments of continuous datasets and a large number of 
training samples. How to overcome these limitations 
is therefore a potential area for future research. 



CONCLUSION 

This article briefly discusses the techniques which 
can be used to improve the performance of the Naive 
Bayes classifier. The general idea is to overcome the 
limitation of the strong independence assumption of the 
Naive Bayes classifier. Relaxing the strong assumption 
is a natural way and has been studied from different 
viewpoints. All the approaches relaxing the assumption 
discussed in the article is restricted Bayesian Networks, 
which are still most practicable techniques. In addition, 
pre-processing procedures are also very useful to make 
the attributes to satisfy the independence assumption. 
However, using these approaches increases the compu- 
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tational complexity to a certain extent. It would be useful 
to model correlations among appropriate attributes that 
can be captured by simple restricted structure but with 
good performance. 
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KEY TERMS 

Decision Trees: Decision tree is a classifier in the 
form of a tree structure, where each node is either a 
leaf node or a decision node. A decision tree can be 
used to classify an instance by starting at the root of 
the tree and moving through it until a leaf node, which 
provides the classification of the instance. A well known 
and frequently used algorithm of decision tree over the 
years is C4.5. 

Forward Selection and Backward Elimination: 

A forward selection method would start with the empty 
set and successively add attributes, while a backward 
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elimination process would begin with the full set and 
remove unwanted ones. 

Greedy Search: At each point in the search, the 
algorithm considers all local changes to the current 
set of attributes, makes its best selection, and never 
reconsiders this choice. 

Independent Component Analysis (ICA): Inde- 
pendent component analysis (ICA) is a newly developed 
technique for finding hidden factors or components to 
give a new representation of multivariate data. ICA 
could be thought of as a generalization of PCA. PCA 
tries to find uncorrelated variables to represent the 
original multivariate data, whereas ICA attempts to 
obtain statistically independent variables to represent 
the original multivariate data. 

Naive Bayes Classifier: The Naive Bayes classifier, 
also called simple Bayesian classifier, is essentially a 



simple Bayesian Network (BN). There exist two under- 
lying assumptions in the Naive Bayes classifier. First, 
all attributes are independent with each other given the 
classification variable. Second, all attributes are directly 
dependent on the classification variable. Naive Bayes 
classifier computes the posterior of classification vari- 
able given a set of attributes by using the Bayes rule 
under the conditional independence assumption. 

Principal Component Analysis (PCA): PCA is 

a popular tool for multivariate data analysis, feature 
extraction and data compression. Given a set of multi- 
variate measurements, the purpose of PCA is to find a 
set of variables with less redundancy. The redundancy 
is measured by correlations between data elements. 

UCI Repository: This is a repository of databases, 
domain theories and data generator that are used by the 
machine learning community for the empirical analysis 
of machine learning algorithms. 
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In this chapter we discuss how fuzzy logic extends the 
envelop of the main data mining tasks: clustering, clas- 
sification, regression and association rules. We begin 
by presenting a formulation of the data mining using 
fuzzy logic attributes. Then, for each task, we provide a 
survey of the main algorithms and a detailed description 
(i.e. pseudo-code) of the most popular algorithms. 



INTRODUCTION 

There are two main types of uncertainty in supervised 
learning: statistical and cognitive. Statistical uncer- 
tainty deals with the random behavior of nature and 
all existing data mining techniques can handle the 
uncertainty that arises (or is assumed to arise) in the 
natural world from statistical variations or randomness. 
Cognitive uncertainty, on the other hand, deals with 
human cognition. 

Fuzzy set theory, first introduced by Zadeh in 1 965, 
deals with cognitive uncertainty and seeks to overcome 
many of the problems found in classical set theory. 
For example, a major problem faced by researchers of 
control theory is that a small change in input results 
in a major change in output. This throws the whole 
control system into an unstable state. In addition there 
was also the problem that the representation of subjec- 
tive knowledge was artificial and inaccurate. Fuzzy 
set theory is an attempt to confront these difficulties 
and in this chapter we show how it can be used in data 
mining tasks. 



BACKGROUND 

Data mining is a term coined to describe the process 
of sifting through large and complex databases for 
identifying valid, novel, useful, and understandable 
patterns and relationships. Data mining involves the 
inferring of algorithms that explore the data, develop 



the model and discover previously unknown patterns. 
The model is used for understanding phenomena from 
the data, analysis and prediction. The accessibility and 
abundance of data today makes knowledge discovery 
and data mining a matter of considerable importance 
and necessity. 

We begin by presenting some of the basic concepts 
of fuzzy logic. The main focus, however, is on those 
concepts used in the induction process when dealing 
with data mining. Since fuzzy set theory and fuzzy 
logic are much broader than the narrow perspective 
presented here, the interested reader is encouraged to 
read Zimmermann (2005). 

In classical set theory, a certain element either be- 
longs or does not belong to a set. Fuzzy set theory, on 
the other hand, permits the gradual assessment of the 
membership of elements in relation to a set. 

Let U be a universe of discourse, representing a 
collection of objects denoted generically by u. A fuzzy 
set A in a universe of discourse U is characterized by 
a membership function ju A which takes values in the 
interval [0, 1]. Where jlx a (u) = means that u is defi- 
nitely not a member of A and ju A (i/) = 1 means that u 
is definitely a member of A. 

The above definition can be illustrated on the 
vague set of Young. In this case the set U is the set 
of people. To each person in U, we define the degree 
of membership to the fuzzy set Young. The member- 
ship function answers the question "to what degree is 
person u young?". The easiest way to do this is with a 
membership function based on the person's age. For 
example Figure 1 presents the following membership 
function: 



VYounM) 1 



age(u) > 32 

1 age(u) <16 
32 -age(u) otherwise 



16 



(1) 
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Figure 1. Membership function for the young set 





Given this definition, John, who is 1 8 years old, has 
degree of youth of 0.875. Philip, 20 years old, has degree 
of youth of 0.75. Unlike probability theory, degrees 
of membership do not have to add up to 1 across all 
objects and therefore either many or few objects in the 
set may have high membership. However, an object's 



membership in a set (such as "young") and the set's 
complement ("not young") must still sum to 1 . 

The main difference between classical set theory 
and fuzzy set theory is that the latter admits to partial 
set membership. A classical or crisp set, then, is a fuzzy 
set that restricts its membership values to {0,1}, the 



Figure 2. Membership function for the crisp young set 
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endpoints of the unit interval. Membership functions 
can be used to represent a crisp set. For example, Figure 
2 presents a crisp membership function defined as: 



H, 



CrispYoung 



00 = 



age(u) > 22 

1 age(u) < 22 



(2) 



In regular classification problems, we assume that 
each instance takes one value for each attribute and 
that each instance is classified into only one of the 
mutually exclusive classes. To illustrate how fuzzy 
logic can help data mining tasks, we introduce the 
problem of modeling the preferences of TV viewers. In 
this problem there are 3 input attributes: A = {Time of 
Day,Age Group,Mood}. The classification can be the 
movie genre that the viewer would like to watch, such 
as C = { Action,Comedy,Drama} . All the attributes are 
vague by definition. For example, people's feelings of 
happiness, indifference, sadness, sourness and grumpi- 
ness are vague without any crisp boundaries between 
them. Although the vagueness of "Age Group" or 
"Time of Day" can be avoided by indicating the exact 
age or exact time, a rule induced with a crisp decision 
tree may then have an artificial crisp boundary, such 
as "IF Age < 16 THEN action movie". But how about 
someone who is 17 years of age? Should this viewer 
definitely not watch an action movie? The viewer pre- 
ferred genre may still be vague. For example, the viewer 
may be in a mood for both comedy and drama movies. 
Moreover, the association of movies into genres may 
also be vague. For instance the movie "Lethal Weapon" 
(starring Mel Gibson and Danny Glover) is considered 
to be both comedy and action movie. 

Fuzzy concept can be introduced into a classical data 
mining task if at least one of the attributes is fuzzy. In 
the example described above , both input and target 
attributes are fuzzy. Formally the problem is defined 
as following: Each class c. is defined as a fuzzy set on 
the universe of objects U. The membership function 
|u c .(i/) indicates the degree to which object u belongs 
to class c. Each attribute a. is defined as a linguistic 
attribute which takes linguistic values from dom(a^) = 
{v.., v^,...v.,, , ,.}. Each linguistic value v., is also a 

1 i,l' i,2' i,\dom(ai)\ } ° i,k 

fuzzy set defined on U. The membership \i (u) speci- 
fies the degree to which object i/'s attribute a. is v. k . 
Recall that the membership of a linguistic value can 
be subjectively assigned or transferred from numerical 
values by a membership function defined on the range 
of the numerical value. 



Typically, before one can incorporate fuzzy concepts 
into a data mining application, an expert is required to 
provide the fuzzy sets for the quantitative attributes, 
along with their corresponding membership functions 
(Mitra and Pal, 2005). Alternatively the appropriate 
fuzzy sets are determined using fuzzy clustering. 



MAIN FOCUS OF THE CHAPTER 

Fuzzy Supervised Learning 

In this section we survey supervised methods that in- 
corporate fuzzy sets. Supervised methods are methods 
that attempt to discover the relationship between input 
attributes and a target attribute (sometimes referred to 
as a dependent variable). The relationship discovered is 
represented in a structure referred to as a model. Usu- 
ally models describe and explain phenomena, which 
are hidden in the dataset and can be used for predicting 
the value of the target attribute knowing the values of 
the input attributes. 

It is useful to distinguish between two main super- 
vised models: classification models (classifiers) and 
Regression Models. Regression models map the input 
space into a real-value domain. For instance, a regressor 
can predict the demand for a certain product given its 
characteristics. On the other hand, classifiers map the 
input space into pre-defined classes. 

Fuzzy set theoretic concepts can be incorporated at 
the input, output, or into to backbone of the classifier. 
The data can be presented in fuzzy terms and the output 
decision may be provided as fuzzy membership values 
(Peng, 2004). In this chapter we will concentrate on 
fuzzy decision trees. The interested reader is encour- 
aged to read also about soft regression (Shnaider et 
al., 1997) andNeuro-fuzzy (Mitra and Hayashi, 2000, 
Nauck, 1997). 

Decision tree is a predictive model which can be used 
to represent classifiers. Decision trees are frequently 
used in applied fields such as finance, marketing, engi- 
neering and medicine. Decision tree are self-explained. 
There is no need to be an expert in data mining in order 
to follow a certain decision tree. 

There are several algorithms for induction of fuzzy 
decision trees (Olaru and Wehenkel, 2003), most of 
them extend existing decision trees methods such as: 
Fuzzy-CART (Jang, 1 994), Fuzzy-ID3 (Cios and Sztan- 
dera, 1 992; Maher and Clair, 1 993). Another complete 
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framework for building a fuzzy tree including several 
inference procedures based on conflict resolution in 
rule-based systems and efficient approximate reasoning 
methods was presented in (Janikow, 1998). 

In this section we will focus on the algorithm pro- 
posed in Yuan and Shaw (1995). This algorithm can 
handle the classification problems with both fuzzy 
attributes and fuzzy classes represented in linguistic 
fuzzy terms. It can also handle other situations in a 
uniform way where numerical values can be fuzzified 
to fuzzy terms and crisp categories can be treated as 
a special case of fuzzy terms with zero fuzziness. The 
algorithm uses classification ambiguity as fuzzy en- 
tropy. The classification ambiguity directly measures 
the quality of classification rules at the decision node. 
It can be calculated under fuzzy partitioning and mul- 
tiple fuzzy classes. 

When a certain attribute is numerical, it needs to 
be fuzzified into linguistic terms before it can be used 
in the algorithm (Hong et al., 1999). The fuzzification 
process can be performed manually by experts or can 
be derived automatically using some sort of cluster- 
ing algorithm. Clustering groups the data instances 
into subsets in such a manner that similar instances 
are grouped together; different instances belong to 
different groups. The instances are thereby organized 
into an efficient representation that characterizes the 
population being sampled. 



One can use a simple algorithm to generate a set 
of membership functions on numerical data. Assume 

attribute a. has numerical value x from the domain X. 

i 

We can cluster X to k linguistic terms v. ., j = 1 ,...,k. The 
size of k is manually predefined. Figure 3 illustrates 
the creation of four groups defined on the age attribute: 
"young", "early adulthood", "middle-aged" and "old 
age". Note that the first set ("young") and the last 
set ("old age") have a trapezoidal form which can be 
uniquely described by the four corners. For example, 
the "young" set could be represented as (0,0,16,32). In 
between, all other sets ("early adulthood" and "middle- 
aged") have a triangular form which can be uniquely 
described by the three corners. For example, the set 
"early adulthood" is represented as (16,32,48). 

The induction algorithm of fuzzy decision tree 
measures the classification ambiguity associated with 
each attribute and split the data using the attribute with 
the smallest classification ambiguity. The classifica- 
tion ambiguity of attribute a. with linguistic terms v. ., 
j = l,...,/c on fuzzy evidence S, denoted as G(a. | S), 
is the weighted average of classification ambiguity 
calculated as: 




G(a z .|S) = Xw(v,.|S).G(v,.|S) 



(3) 



Figure 3. Membership function for various groups in the age attribute 
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where w(v. . | S) is the weight which represents the 
relative size of v. . and is defined as: 



w(y Uj \S) ■ 



M(v U] \S) 
E M K*I S ) 



(4) 



The classification ambiguity of v. . is defined as 



G(vJS) = g(p(c|vJ 



which is measured based on the possibility distribu- 
tion vector 



P Cv,j = \p kv (J ,...,p c M v (iJ 



Given v.., the possibility of classifying an object to 
class c, can be defined as: 



P \ c >\ v i,i 



s(y U} ,c,) 



max s ( v u' c ^) 



(5) 



where S(A,B) is the fuzzy subsethood that measures the 
degree to which A is a subset of B. The subsethood 
can be used to measure the truth level of the rule of 
classification rules. For example given a classification 
rule such as "IF Age is Young AND Mood is Happy 
THEN Comedy" we have to calculate S(HotnSunny, 
Swimming) in order to measure the truth level of the 
classification rule. 

The function g (p) is the possibilistic measure of 
ambiguity or nonspecificity and is defined as: 



\p\ 



9(p)=£(p*-Pw)- ln (o 



(6) 



where 



P =[Piv..,P w 



is the permutation of the possibility distribution p 
sorted such that p* > p* +1 . All the above calculations 
are carried out at a predefined significant level a. An 
instance will take into consideration of a certain branch 
v only if its corresponding membership is greater 



than a. This parameter is used to filter out insignificant 
branches. 

After partitioning the data using the attribute with the 
smallest classification ambiguity, the algorithm looks 
for nonempty branches. For each nonempty branch, 
the algorithm calculates the truth level of classifying 
all instances within the branch into each class. The 
truth level is calculated using the fuzzy subsethood 
measure S(A,B). 

If the truth level of one of the classes is above a 
predefined threshold (3 then no additional partitioning 
is needed and the node become a leaf in which all 
instance will be labeled to the class with the highest 
truth level. Otherwise the procedure continues in a 
recursive manner. Note that small values of P will lead 
to smaller trees with the risk of underfitting. A higher 
(3 may lead to a larger tree with higher classification 
accuracy. However, at a certain point, higher values (3 
may lead to overfitting. 

In a regular decision tree, only one path (rule) can 
be applied for every instance. In a fuzzy decision tree, 
several paths (rules) can be applied for one instance. In 
order to classify an unlabeled instance, the following 
steps should be performed: 

Step 1 : Calculate the membership of the instance 
for the condition part of each path (rule). This 
membership will be associated with the label 
(class) of the path. 

Step 2: For each class calculate the maximum 
membership obtained from all applied rules. 
Step 3: An instance may be classified into sev- 
eral classes with different degrees based on the 
membership calculated in Step 2. 

Fuzzy Clustering 

The goal of clustering is descriptive, that of classification 
is predictive. Since the goal of clustering is to discover 
a new set of categories, the new groups are of inter- 
est in themselves, and their assessment is intrinsic. In 
classification tasks, however, an important part of the 
assessment is extrinsic, since the groups must reflect 
some reference set of classes. 

Clustering groups data instances into subsets in such 
a manner that similar instances are grouped together, 
while different instances belong to different groups. 
The instances are thereby organized into an efficient 
representation that characterizes the population being 
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sampled. Formally, the clustering structure is repre- 
sented as a set of subsets C = C 1 ,...,C k of S, such that: 

s-ULi 

and C n C = for z ^ j. Consequently, any instance 
in S belongs to exactly one and only one subset. 

Traditional clustering approaches generate parti- 
tions; in a partition, each instance belongs to one and 
only one cluster. Hence, the clusters in a hard cluster- 
ing are disjointed. Fuzzy clustering (Nasraoui and 
Krishnapuram, 1997,Shnaideretal., 1997) extends this 
notion and suggests a soft clustering schema. In this 
case, each pattern is associated with every cluster using 
some sort of membership function, namely, each cluster 
is a fuzzy set of all the patterns. Larger membership 
values indicate higher confidence in the assignment 
of the pattern to the cluster. A hard clustering can be 
obtained from a fuzzy partition by using a threshold 
of the membership value. 

The most popular fuzzy clustering algorithm is the 
fuzzy c-means (FCM) algorithm. FCM is an iterative 
algorithm. The aim of FCM is to find cluster centers 
(centroids) that minimize a dissimilarity function. To 
accommodate the introduction of fuzzy partitioning, 
the membership matrix(U) is randomly initialized ac- 
cording to Equation 7. 



|> v =l,Vj =1,...," 



(7) 



The algorithm minimizes a dissimilarity (or distance) 
function which is given in Equation 13: 



J([/, Cl ,c 2 ,...,c c ) = XJ,=ZZ u ^ 



i=l j=l 



(8) 



where, u.. is between and 1; c. is the centroid of cluster 

y ' 

z; d.. is the Euclidian distance between z'-th centroid and 

5 y 

j-th data point; m is a weighting exponent. 

To reach a minimum of dissimilarity function there 
are two conditions. These are given in Equation 9 and 
Equation 10. 






y c M 



2/(m-l) 



(10) 




By iteratively updating the cluster centers and the 
membership grades for each data point, FCM itera- 
tively moves the cluster centers to the "right" location 
within a data set. However, FCM does not ensure that 
it converges to an optimal solution. The random ini- 
tialization of U might have uncancelled effect on the 
final performance. 

Fuzzy Association Rules 

Association rules are rules of the kind "70% of the 
customers who buy vine and cheese also buy grapes". 
While the traditional field of application is market basket 
analysis, association rule mining has been applied to 
various fields since then, which has led to a number of 
important modifications and extensions. 

A fuzzy association algorithm is proposed in Komem 
and Schneider (2005). The quantitative values are first 
transformed into a set of membership grades, by using 
predefined membership functions. Every membership 
grade represents the agreement of a quantitative value 
with a linguistic term. In order to avoid discriminating 
the importance level of data, each point must have mem- 
bership grade of 1 in one membership function; Thus, 
the membership functions of each attribute produce 
a continuous line of |u = 1 . Additionally, in order to 
diagnose the bias direction of an item from the center 
of a membership function region, almost each point 
get another membership grade which is lower than 1 
in other membership functions region. Thus, each end 
of membership function region is touching, close to, 
or slightly overlapping an end of another membership 
function (except the outside regions, of course). 

By this mechanism, as point "a" moves right, further 
from the center of the region "middle", it gets a higher 
value of the label "middle-high", additionally to the 
value 1 of the label "middle". 



FUTURE TRENDS 

Some of the challenges of using fuzzy theory in data 
mining tasks, include the following: 



(9) 
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1 . Incorporation of domain knowledge for improving 
the fuzzy modeling. 

2. Developing methods for presenting fuzzy data 
model to the end-users. 

3 . Efficient integration of fuzzy logic in data mining 
tools. 

4. A hybridization of fuzzy sets with data mining 
techniques. 



CONCLUSIONS 

This chapter discussed how fuzzy logic can be used 
to solve several different data mining tasks, namely 
classification clustering, and discovery of association 
rules. The discussion focused mainly one representative 
algorithm for each of these tasks. 

There are at least two motivations for using fuzzy 
logic in data mining, broadly speaking. First, as men- 
tioned earlier, fuzzy logic can produce more abstract 
and flexible patterns, since many quantitative features 
are involved in data mining tasks. Second, the crisp 
usage of metrics is better replaced by fuzzy sets that 
can reflect, in a more natural manner, the degree of 
belongingness/membership to a class or a cluster. 
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KEY TERMS 

Association Rules: Techniques that find in a data- 
base conjunctive implication rules of the form "X and 
Y implies AandB." 

Attribute: A quantity describing an instance. An 
attribute has a domain defined by the attribute type, 
which denotes the values that can be taken by an at- 
tribute. 

Classifier: A structured model that maps unlabeled 
instances to finite set of classes. 
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Clustering: The process of grouping data instances 
into subsets in such a manner that similar instances are 
grouped together into the same cluster, while different 
instances belong to different clusters. 

Data Mining: The core of the KDD process, involv- 
ing the inferring of algorithms that explore the data, 
develop the model, and discover previously unknown 
patterns. 

Fuzzy Logic: A type of logic that recognizes more 
than simple true and false values. With fuzzy logic, 
propositions can be represented with degrees of truth- 



fulness and falsehood thus it can deal with imprecise 
or ambiguous data. Boolean logic is considered to be 
a special case of fuzzy logic. 

Instance: A single object of the world from which 
a model will be learned, or on which a model will be 
used. 

Knowledge Discovery in Databases (KDD): A 

nontrivial exploratory process of identifying valid, 
novel, useful, and understandable patterns from large 
and complex data repositories. 
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INTRODUCTION 

Several unsupervised learning topics have been ex- 
tensively studied with wide applications for decades 
in the literatures of statistics, signal processing, and 
machine learning. The topics are mutually related and 
certain connections have been discussed partly, but 
still in need of a systematical overview. The article 
provides a unified perspective via a general frame- 
work of independent subspaces, with different topics 
featured by differences in choosing and combining 
three ingredients. Moreover, an overview is made via 
three streams of studies. One consists of those on the 
widely studied principal component analysis (PC A) and 
factor analysis (FA), featuredby the second order inde- 
pendence. The second consists of studies on a higher 
order independence featured independent component 
analysis (ICA), binary FA, and nonGaussian FA. The 
third is called mixture based learning that combines 
individual jobs to fulfill a complicated task. Extensive 
literatures make it impossible to provide a complete 
review. Instead, we aim at sketching a roadmap for 
each stream with attentions on those topics missing 
in the existing surveys and textbooks, and limited to 
the authors' knowledge. 



A GENERAL FRAMEWORK OF 
INDEPENDENT SUBSPACES 

A number of unsupervised learning topics are featured 
by its handling on a fundamental task. As shown in 
Fig. 1 (b), every sample x is projected into x on a mani- 
fold and the error e = x - x of using x to represent 
x is minimized collectively on a set of samples. One 
widely studied situation is that a manifold is a subspace 
represented by linear coordinates, e.g., spanned by three 
linear independent basis vectors ct 1 ,a 2 , a 3 as shown in 
Fig. 1(a). So, x can be represented by its projection 
y (j) on each basis vector, i.e., 



or 



x = x + e = Ay + e, [y = f ,y (2) ,y (3) f . 



(1) 



* = !>% 



Typically, the error e-x-x is measured by the 
square norm, which is minimized whene is orthogonal 
to x . Collectively, the minimization of the average error 
|e| on a set of samples or its expectation E|e| 2 isfeatured 
by those natures given at the bottom of Fig. 1(a). 

Generally, the task consists of three ingredients, as 
shown in Fig.2. First, how the error e- x-x is meas- 
ured. Different measures define different projections. 

II l|2 

The square norm d = ||e|| applies to a homogeneous 
medium between x and x . Other measures are needed 
for inhomogeneous mediums. In Fig. 1(c), a non-or- 
thogonal but still linear projection is considered via 
d = IHL = e *^ e with E e _1 = B T B , as if e is first mapped 
to a homogeneous medium by a linear mapping e and 
then measured by the square norm. Shown at the bot- 
tom of Fig. 1(c) are the natures of this Min||e|| B . Being 
considerably different from those of Min||e| 2 , more 
assumptions have to be imposed externally. 

The second ingredient is a coordinate system, via 
either linear vectors in Fig.l(a)&(c) or a set of curves 
on a nonlinear manifold in Fig. 1 (b). Moreover, there 
is the third ingredient that imposes certain structure 
to further constrict how y is distributed within the 
coordinates, e.g., by the nature d). 

The differences in choosing and combining the 
three ingredients lead to different approaches. We use 
the name "independent subspaces" to denote those 
structures with the components of y being mutually 
independent, and get a general framework for accom- 
modating several unsupervised learning topics. 

Subsequently, we summarize them via three 
streams of studies by considering 

d = \\e\\ =e T l~ 1 e and two special cases, 

ii He e ■*- 

three types of independence structure, and wheth- 
er there is temporal structure among samples, 
varying from one linear coordinate system to 
multiple linear coordinate systems at different 
locations, as shown in Fig.2. 
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STUDIES FEATURED BY SECOND 
ORDER INDEPENDENCE 

We start at considering samples of independently and 
identically distributed (i.i.d.) by linear coordinates and 
an independent structure of a Gaussian P(yf |i d) ) , with 
the projection measure varying as illustrated within the 
first column of the table in Fig.2. We encounter factor 
analysis (FA) in the general case d = \\e\\ = e B 1 Be. At 
the special case B =o I, the linear coordinates span 
a principal subspace of data. Further imposing A T A = 
I and requiring the columns of A given by the first m 
principal components (PCs), i.e., eigenvectors that 
correspond the largest eigenvalues of E = (B T B) -1 . 
It becomes equivalent to PCA. Moreover, at the de- 
generated case e = 0,y = xW de-correlates components 
of y, e.g., performing a pre-whitening as encountered 
in signal processing. 

We summarize studies on the Roadmap A. The first 
stream originated from 1 00 years ago. The first adaptive 
learning one is Oja rule that finds the l st -PC (i.e., the 
eigenvector that corresponds the largest eigenvalue 
of X ), without explicitly estimating Z . Extended 
to find multi-PCs, one way is featured by either an 
asymmetrical or a sequential implementation of the 
l st -PC rule, but suffering error-accumulation. Details 
are referred to Refs.5,6,67,76,96 in (Xu, 2007a). The 
other way is finding multi-PCs symmetrically, e.g., Oja 
subspace rule. Further studies are summarized into the 
following branches: 

MCA, Dual Subspace, and TLS Fitting 

In(Xu,Krzyzak&Oja, 1991), a dual pattern recognition 
is suggested by considering both the principal subspace 
and its complementary subspace, as well as both the 
multiple PCs and its complementary counterparts—the 
components that correspond the smallest eigenvalues 
of 2 (i.e., the row vectors of U in Fig.2). Moreover, 
the first adaptive rule is proposed by eqn.(l la) in (Xu, 
Krzyzak&Oja, 1991) to get the component that corre- 
sponds the smallest eigenvalue of 2 , under the name 
Minor component analysis (MCA) firstly coined by Xu, 
Oja&Suen ( 1 992), and it is also used for implementing a 
total least square (TLS) curve fitting. Subsequently, this 
topic has been brought to the signal processing literature 
by Gao, Ahmad & Swamy (1992) that was motivated 
by a visit of Gao to Xu's office where Xu introduced 
himtheresultofXu,Oja&Suen(1992). Thereafter, adap- 



tive MCA learning for TLS filtering becomes a popular 
topic of signal processing, see (Feng,Bao&Jiao,1998) 
and Refs.24,30,58,60 in (Xu,2007a). 

It was also suggested in (Xu,Krzyzak&Oja,1992) 
that an implementation of PCA or MCA is made by 
switching the updating sign in the above eqn.( 1 1 a). Ef- 
forts were subsequently made to examine the existing 
PCA rules on whether they remain stable after such 
a sign switching. These jobs usually need tedious 
mathematical analyses of ODE stability, e.g., Chen & 
Amari (2001). An alternative way is turning an opti- 
mization of a PCA cost into a stable optimization of an 
induced cost for MCA, e.g., the LMSER cost is turned 
into one for subspace spanned by multiple MCs (Xu, 
1994, see Refill, Xu2007a). A general method is 
further given by eqns(24-26) in (Xu, 2003) and then 
discussed in (Xu, 2007a). 

LMSER Learning and Subspace Tracking 

A new adaptive PCA rule is derived from the gradient 
VE 2 (W) for a least mean square error reconstruction 
(LMSER) (Xu,1991), with the first proof proposed 
on global convergence of Oja subspace rule— a task 
that was previously regarded as difficult. It was 
shown mathematically and experimentally that LM- 
SER improves Oja rule by further comparative stud- 
ies, e.g, see (Karhunen,Pajunen&Oja,1998) and see 
(Refsl4,15,48,54,71,72, Xu2007a). Two years after 
(Xu, 1991), this £ 2 ( W) is used for signal subspace track- 
ing via a recursive least square technique (Yang, 1 993), 
then followed by others in the signal processing litera- 
ture (Refs.33&55, Xu2007a). Also, PCA and subspace 
analysis can be performed by other theories or costs 
(Xu, 1 994a&b). The algebraic and geometric properties 
were further analyzed on one of them, namely relative 
uncertainty theory (RUT), by Fiori (2000&04, see 
Refs.25,29, Xu2007a). Moreover, the NIC criterion 
for subspace tracking is actually a special case of this 
RUT, which can be observed by comparing eqn.(20) 
in (Miao& Hua,1998 ) with the equation of Pe at the 
end of Sec.III.B in (Xu, 1994a). 

Principal Subspace vs. Multi-PCs 

Oja subspace rule does not truly find the multi-PCs 
due to a rotation indeterminacy. Interestingly, it is 
demonstrated experimentally that adding a sigmoid 
function makes LMSER approximate the multi-PCs 



894 



Independent Subspaces 



Figure 3. 
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see Ref.22,7S, Xu,2007a) 



weighted LMSER 

rule for multi- PCA 
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(a) IMSERcost 

(b) first global 
convergence proof on 

Oja subspace rule 

(b) adaptive LMSER 

for subspace. 

(Xu,1991) 

T 



Other theories for 

subspace and 

multi-PCA: 

(a) mini-distorted 

reflection 
(b) maximize 

relative 

uncertainty 

theory (RUT) 

(c) max- variation 

( Xu, 1994a) 

X 



,») multi -factors analvsis 

(Thurston, 1945;" 

see Hef.86, Xu20Q7a) 

b) theoretical exposition 

(Anderson &Ru bin, 1956; 

seeftef.3,Xu20Q7a) 



(a) Maximum 

likelihood (ML) 

factor analysis (FA) 

by EM algorithm 

(Rubi &1 hayer, 1976) 

(b) Revisit a special 

case of FA under 

nameorSPCA 

or PPCA 

(Tipping & Bishop, 

1999; Roweis 199S; 

sec Refs. 75, 84, 

Xu20O7a) 




X 



(a) adaptive 

KM algorithm 

with automatic 

selection on factors. 

(b) adaptive BYY 

learning algorithm 

with automatic 

selection on factors, 

see eqn(79) in 

(Xu, 2001a) and eqn(2V 

&(22) in (Xu 2001b) 

(c) also a criterion for 

selecting the number of 

fuctvrs<Xu, 2001a&b, 

2003,2007c) 

* 



Temporal FA 

& adaptive KM 

algorithm 

(Sec. IV(Q in Xu, 2000, 

submitted in July 1997) 



ffl ^£^ 3 



yy 




Note: due to a limited space, it is impossible to put all the t-eferenices into the reference list of this article. The 
problem is solved in help of (Xu, 2007a) via citing papers in its reference list where there are 123 entries. 
E.g„ "see Kefs,22,78,Xu20O7a" means "see the entries [22] [78] in the reference list of (Xu, 2007a). 
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well (Xu,1991). Working at Harvard in the late sum- 
mer 1991, Xu got aware of Brockett (1991) and thus 
extended the Brockett flow of nxn orthogonal matrices 
to that of nx^ orthogonal matrices with n > n l5 from 
which two learning rules for truly the multi-PCs are 
obtained through modifying the LMSER rule and Oja 
subspace rule. The two rules were included as eqns 
(13)&(14) in Xu (1993) that was submitted in 1991, 
which are independent and also different from Oja 
( 1 992). Recently, Tanaka (2005) unifies these rules into 
one expression controlled by one parameter, and a 
comparative study was made to show that eqn(14)in 
(Xu,1993) turned out to be the most promising one. 

Adaptive Robust PCA 

In the statistics literature, robust PCA was proposed to 
resist outliers via a robust estimator on S . Xu&Yuille 
(1992&95) generalized the rules of Oja, LMSER, and 
MC Ainto robust adaptive learning by statistical physics, 
related to the Huber M-estimators. Also, the PCA costs 
in (Xu, 1 994b) are extended to robust versions in Tab. 2 
of (Xu, 1994a). Thereafter, efforts have been further 
made, including its use in computer vision, e.g., see 
(Refs9,2 1,45,52, Xu2007a). 

On Roadmap A, another branch consists of ad- 
vances on FA, which includes PCA as its special case 
at Z e =o e 2 I. In the past decade, there is a renewed 
interest on FA, not only the EM algorithm for FA is 
brought to implementing PCA, but also adaptive EM 
algorithm and other advances are developed in help of 
the Bayesian Ying Yang (BYY) harmony learning. 



SUBSPACES OF HIGHER ORDER 
INDEPENDENCE 

Noticing the table in Fig.2, we proceed as p(y^ | \i Q) ) 
becomes nonGaussian ones in the last two columns. 
Shown at the left-upper corner on Roadmap B, the de- 
generated case e = leads to the problem of solving x = 
Ay from samples of x and an independence constraint 



p(y) = T\p(y U) ) 



j=i 



One way is solving induced nonlinear algebraic 
equations. Another way is called independent com- 



ponent analysis (ICA), tackled in the following four 
branches: 

Seeking extremes of the higher order cumulants 
ofy. 
• Using nonlinear Hebbian learning for removing 
higher order dependences among components of 
y, actually from which ICA studies originate. 
Optimizing a cost that bases on 



p(y)=T[p(y U) ) 



directly. As shown on Roadmap B, a same up- 
dating equation is reached from several aspects, 
with actual differences coming from pre-specify- 
ing the nonlinearity of f (y J ) . One works when 
the source components of y* are all subgaussians 
while the other works when the components of 
y*are all supergaussians. This problem is solved 
by learning jointly Wand f (y ) via a parametric 
model. It is further found that a rough estimate of 
each source is already enough, which motivates 
the so called one-bit-matching conjecture that is 
recently proved to be true mathematically (Xu, 
2007b). 

Implementing nonlinear LMSER (Xu, 1 99 1 &93). 
Details are referred to Roadmap B. Here, we 
add clarifications on two previous confusions. 
One relates to an omission of the origin of non- 
linear LMSER. This has already been clarified 
in (Karhunen,Pajunen, &Oja,1998; Hyvarinen, 
Karhunen, & Oja, 2001;Plumbley &Oja,2004), 
clearly spelling out that the nonlinear E 2 (W) and 
its adaptive gradient rule were both proposed 
firstly in (Xu, 1991&93). The second confusion 
is about that ICA is usually regarded as a coun- 
terpart of PCA. As stated in (Xu,2001b&03) 
and observed from the Table in Fig.2, ICA by y 
= xW is actually an extension of de-correlation 
analysis, in any combinations of PCs and MCs. 
The counterpart of MCA is minor ICA (M-ICA) 
while the counterpart of PCA is principal ICA 
(P-ICA). 

In fact, the concept ^principal' emerges from e t = 
x t -Ay^ 0. As shown within the table in Fig.2 and on 
the rightmost column on Roadmap B, as p(y^ ln a) ) 




897 



Independent Subspaces 



becomes nonGaussian ones, FA is extended to a binary 
FA (BFA) if y is binary, and a nonGaussian FA (NFA) 
if y is real but nonGaussian. Similar to FA perform- 
ing PCA at X e = o ]l , both BFA and NFA become to 
perform a P-ICA at S e = o ]l . 

Observing the first box in this column, for e t = x 
- Ay ^ we need to seek an appropriate nonlinear map 
y = f(x). It usually has no analytical solution but needs 
an expensive computation to approximate. As discussed 
in (Xu, 2003), nonlinear LMSER uses a sigmoid non- 
linearity y ( t J) = s ( z t J) ), z = xWto avoid computing costs 
and approximately implements a BFA for a Bernoulli 
p(y 0) ) witha probability Pj =iTH=A z t J) ) andaNFAfor 
P(y } ) with a pseudo uniform distribution on (-00, +00), 
as well as a nonnegative ICA (Plumbley&Oja,2004) 
when p(y ] ) is on [0, +00). However, further quantita- 
tive analysis is needed for this approximation. 

Without approximation, the EM algorithm is de- 
veloped for maximum likelihood learning since 1997, 
still suffering expensive computing costs. Favorably, 
further improvements have also been achieved by the 
BYY harmony learning. Details are referred to the 
rightmost column on Roadmap B. 



Next, we move to multiple subspaces at different 
locations as shown in Fig.2. Studies are summarized on 
Roadmap C, categorized according to onekeypoint, i.e., 
a scheme p^ t that allocates a sample x f to different 
subspaces. This p £t bases on two issues. 

One is a local measure on how the £ -th subspace is 
suitable for representing * f . The other is a mechanism 
that summarizes the local measures of subspaces to 
yield p^ t . One typical mechanism is that emerges 
in the EM algorithm for the maximum likelihood or 
Bayesian learning, where x t is fractionally allocated 
among subspaces proportional to their local measures. 
Another typical mechanism is that x f is nonlinearly 
located to one or more winners via a competition based 
on the local measures, e.g„ as in the classic competitive 
learning and the rival penalized competitive learning 
(RPCL). 

Also, a scheme p^ t may come from blending both 
types of mechanisms, as that from the BYY harmony 
learning. Details are referred to (Xu,2007c) and its 
two http-sites. 



FUTURE TRENDS 



TEMPORAL AND LOCALIZED 
EXTENSIONS 

We further consider temporal samples shown at the 
bottom of the rightmost column on both Roadmap A 
and Roadmap B, via embedding a temporal structure 
in p(yf I \i f) . A typical one is using 



11? 



y'Of,^), Yf = {y? T }^ 



e.g., a linear regression 

to turn a model (e.g., one in the table of Fig.2) into 
temporal extensions. Information is carried over time 
in two ways. One is computing \if by the regres- 
sion, with learning on Pt made through the gradient 
with respect to j j by a chain rule. The second is 
computing jp(yf | ^pCY^dY^ and getting the 
gradient with respect to J j . Details are referred to Xu 
(2000&01a&03). 



Another important task is how to determine the number 
k of subspaces and the dimension m f of each subspace. 
It is called model selection, usually implemented in 
two phases. First, a set of candidates are considered 
by enumerating k and m £ , with unknown parameters 
estimated by the maximum likelihood learning. Second, 
the best among the candidates is selected by one of 
criteria, such as AIC, CAIC, SIC/BIC/MDL, Cross 
Validation, etc. However, this two-phase implemen- 
tation is computationally very extensive. Moreover, 
the performance will degenerate considerably when 
the sample size is finite while k and m f are not too 
small. 

One trend is letting model selection to be made 
automatically during learning, i.e., on a candidate 
with k and m £ initially being large enough, learn- 
ing not only determines unknown parameters but also 
automatically shrinks k and m f to appropriate ones. 
Two such efforts are RPCL and the BYY harmony 
learning. Details are referred to (Xu,2007c) and its 
two http-sites. 

Also, there are open issues on x = Ay + e, e £ 0, 
with components of y mutually independent in higher 
order statistics. Some are listed below: 
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Figure 5. 
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see Ref.50, Xu2007a) 


Local MCA by MML 

(Sec .4.1, X n 1995; 
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(equJ7, Xu 2001b) 
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Kef. 103, Xu 2007a) 

f .oral FA 

(Set J J. 2, Xu, 2007c) 

Local TFA (Xu T 2004; see 

Kef. 95, Xii2(X)7a) 


Local MCA by MML 

(eqn.15, Xu, 1998 4 see 

Ref.103, Xu2007a> 

(Sec.42.Xii, 2001b) 


Improved 

competitive ICA 

(Kec.4, Xu 2002: see 

Ref.98, Xu20G7a) 


laical UFA, NFA, 

LMSKK 

(eqns.43&44, Xu 2001b) 

(Sec. 4, Xu 2002; see 

Ref.98, Xu2007a) 
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liOcal FCA h\ a simplified 

EM (Sec.V{B)(D), Xu 1994b) 

Mixtures of FA 

(Chain amani & Hiiilon. 

19%; see Kef.35, Xu 2007a) 

Mb. Lures or probabilistic 

PCA (Tipping & Bishop. 

1999; see Ref.84, Xu2007a) 


Local MCA by 

a simplified EM 

(Scr.V(C)([>X Xii 1994b) 

MCA Co-integration 

(Xu & Leung, 1998, 

sec Kef. 105, Xu2007a) 

Probabilistic MCA 

(Williams & Asa kov, 2002. 

see Rcf.91 k Xu2007a) 


ICA mixture 

[Lee. Lewicki, & 

Sejuowskl, 2000; 

sec Ref.50, 

Xu2007a) 


One possible way 

Is getting extension from 

(Mou II nes, Cardoso & 

Cassia 1 1997; At lias 1999; 

sec Ref.4,61, Xu2007a) 

but v, ii h much expensive 
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Variational Mixture 
(Chahramani & Bcah 2000; 
I'lsii2i & Kumagal, 2001; see 

Kcf.34,90, Xn2007a) 




Variational Mixture 

(Choudrey & 

Roberts 2003; see 

Kef. 17, Xii20O7a) 



Which part of unknown parameters in x = Ay + e 

can be determined uniquely ? 

Under which conditions, the independence 



and the best reconstruction of x by x=Ay can be 
achieved simultaneously? If not, what is the best 
nonlinear y = f(x) in term of both 



P(y) = t[p(y u) ) 



p(y) = Il^ (,) ) 



can be ensured in concept? Can it be further 
achieved by a learning algorithm? 
In what a sense, both ensuring 



and e ± 0? 

Can such a best be obtained analytically or via 

an effective computing? 



p(y)=T[p(y u) ) 



CONCLUSION 



Studies of three closely related unsupervised learning 
streams have been overviewed in an extensive scope 
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and from a systematic perspective. A general frame- 
work of independent subspaces is presented, from 
which a number of learning topics are summarized 
via different features of choosing and combining the 
three basic ingredients. 
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KEY TERMS 

BYY Harmony Learning: It is a statistical learning 
theory for a two pathway featured intelligent system 
via two complementary Bayesian representations of 
the joint distribution on the external observation and 
its inner representation, with both parameter learning 
and model selection determined by a principle that 
two Bayesian representations become best harmony. 
See http://www.scholarpedia.org/article/Bayes- 
ian_Ying_Yang_Learning. 

Factor Analysis: A set of samples {x t } t=1 is de- 
scribed by a linear model x = Ay + |u + e, where |u is a 
constant, y and e are both from Gaussian and mutually 
uncorrected, and components of y are called factors 
and mutually uncorrected. Typically, the model is 
estimated by the maximum likelihood principle. 

Independence Subspaces: It refers to a family of 
models, each of which consists of one or several sub- 
spaces. Each subspace is spanned by linear independent 



basis vectors and the corresponding coordinates are 
mutually independent. 

Least Mean Square Error Reconstruction (LM- 

SER): For an orthogonal projection x t onto a subspace 
spanned by the column vectors of a matrix W, maximiz- 
ing n 2-ft=i^ Xt ' subject to w w = i is equivalent to 

1 x^ N II ~ || 2 

minimizing the mean square error -- / ,. , \ \ x t ~ x t \ \ by 
using the projection x t = WW T x t as reconstruction of 
x f , which is reached when W spans the same subspace 
spanned by the PCs. 

Minor Component (MC): Being orthogo- 
nal complementary to the PC, the solution of 

m V=i} J(w)= Nlt=i( wrx t) 2 =w T Ew is the MC, 
while the m-MCs are referred to the columns of W 

that minimizes J(W) = ^X t N =1 H wrx t If =Tr[W T XW] 
subject to w'w = i . 

Principal Component (PC): For samples |x t j 
with a zero mean, its PC is a unit vector w originated 
at zero with a direction along which the average of the 
orthogonal projection by every sample is maximized, 
i.e., max t J(w) = ^Y fw r x t ) 2 = w T Iw, the 

(w r w=l} v 7 Nijt=l v t7 

solution is the eigenvector of the sample covariance 
matrix 2 = ^-^ x t x^, corresponding to the largest 
eigen-value. Generally, the m-PCs are referred to 
the m orthonormal vectors as the columns of W that 

maximizes J(W) = ^^rjl Wrx t If =Tr\W T ZW]. 

Rival Penalized Competitive Learning: It is a 

development of competitive learning in help of an 
appropriate balance between participating and leav- 
ing mechanisms, such that an appropriate number of 
agents or learners will be allocated to learn multiple 
structures underlying observations. See http://www. 
scholarpedia.org/article/Rival_Penalized_Competi- 
tiveLearning. 

Total Least Square (TLS) Fitting: Given samples 
{z t } , z t = [y t ,x^] T , instead of finding a vector w 

i V N II r | 2 

to minimize the error Iv2-Jt=i|r * _W Xt I ' ^ e ^^ 

fitting is finding an augmented vector w = [w r , c] T such 

± ^N ||~ T ||2 

that the error n" / , t 1 || w z t || is minimized subject 
to w T w = l, the solution is the MC of {z t }f =1 . 
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INTRODUCTION 

Learning systems depend on three interrelated com- 
ponents: topologies, cost/performance functions, and 
learning algorithms. Topologies provide the constraints 
for the mapping, and the learning algorithms offer the 
means to find an optimal solution; but the solution is 
optimal with respect to what? Optimality is character- 
ized by the criterion and in neural network literature, this 
is the least addressed component, yet it has a decisive 
influence in generalization performance. Certainly, the 
assumptions behind the selection of a criterion should 
be better understood and investigated. 

Traditionally, least squares has been the benchmark 
criterion for regression problems; considering classifi- 
cation as a regression problem towards estimating class 
posterior probabilities, least squares has been employed 
to train neural network and other classifier topologies 
to approximate correct labels. The main motivation to 
utilize least squares in regression simply comes from 
the intellectual comfort this criterion provides due to 
its success in traditional linear least squares regression 
applications - which can be reduced to solving a sys- 
tem of linear equations. For nonlinear regression, the 
assumption of Gaussianity for the measurement error 
combined with the maximum likelihood principle could 
be emphasized to promote this criterion. In nonpara- 
metric regression, least squares principle leads to the 
conditional expectation solution, which is intuitively 
appealing. Although these are good reasons to use the 
mean squared error as the cost, it is inherently linked to 
the assumptions and habits stated above. Consequently, 
there is information in the error signal that is not cap- 
tured during the training of nonlinear adaptive systems 
under non-Gaussian distribution conditions when one 
insists on second-order statistical criteria. This argu- 
ment extends to other linear-second-order techniques 
such as principal component analysis (PCA), linear 
discriminant analysis (LDA), and canonical correlation 



analysis (CCA). Recent work tries to generalize these 
techniques to nonlinear scenarios by utilizing kernel 
techniques or other heuristics. This begs the question: 
what other alternative cost functions could be used 
to train adaptive systems and how could we establish 
rigorous techniques for extending useful concepts 
from linear and second-order statistical techniques 
to nonlinear and higher-order statistical learning 
methodologies? 



BACKGROUND 

This seemingly simple question is at the core of recent 
research on information theoretic learning (ITL) con- 
ducted by the authors, as well as research by others on 
alternative optimality criteria for robustness to outli- 
ers and faster convergence, such as different L -norm 
induced error measures (Sayed, 2005), the epsilon-in- 
sensitive error measure (Scholkopf & Smola, 2001), 
Huber's robust m-estimation theory (Huber, 1981), or 
Bregman's divergence based modifications (Bregman, 
1 967). Entropy is an uncertainty measure that general- 
izes the role of variance in Gaussian distributions by 
including information about the higher-order statistics 
of the probability density function (pdf) (Shannon & 
Weaver, 1964; Fano, 1961; Renyi, 1970; Csiszar & 
Korner, 1981). For on-line learning, information theo- 
retic quantities must be estimated nonparametrically 
from data. A nonparametric expression that is differ- 
entiate and easy to approximate stochastically will 
enable importing useful concepts such as stochastic 
gradient learning and backpropagation of errors. The 
natural choice is kernel density estimation (KDE) 
(Parzen, 1967), due its smoothness and asymptotic 
properties . The plug-in estimation methodology (Gyorfi 
& van der Meulen, 1990) combined with definitions 
of Renyi (Renyi, 1970), provides a set of tools that are 
well-tuned for learning applications - tools suitable 



Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. 



Information Theoretic Learning 



for supervised and unsupervised, off-line and on-line 
learning. Renyi's definition of entropy for a random 
variable X is 



H a (X): 



1 



1-oc 



log|p a (x)dx 



(1) 



This generalizes Shannon's linear additivity postu- 
late to exponential additivity resulting in a parametric 
family. Dropping the logarithm for optimization simpli- 
fies algorithms. Specifically of interest is the quadratic 
entropy (a=2), because its sample estimator requires 
only one approximation (the density estimator itself) 
and an analytical expression for the integral can be 
obtained for kernel density estimates. Consequently, a 
sample estimator for quadratic entropy can be derived 
for Gaussian kernels of standard deviation o on an iid 
sample set {x 19 ...,x N } as the sum of pairwise sample 
(particle) interactions (Principe et al, 2000): 



to the LMS (least-mean-square) algorithm - essential 
for training complex systems with large data sets. 
Supervised and unsupervised learning is unified under 
information-based criteria. Minimizing error entropy in 
supervised regression or maximizing output entropy for 
unsupervised learning (factor analysis), minimization 
of mutual information between the outputs of a system 
to achieve independent components or maximizing 
mutual information between the outputs and the desired 
responses to achieve optimal subspace projections in 
classification is possible. Systematic comparisons of 
ITL with conventional MSE in system identification 
verified the advantage of the technique for nonlinear 
system identification and blind equalization of com- 
munication channels. Relationships with instrumental 
variables techniques were discovered and led to the 
error-whitening criterion for unbiased linear system 
identification in noisy-input-output data conditions 
(Rao et al, 2005). 




H 2 (X) = -log(^iiG^(x i -x j )) 

iV i=i j=l 



(2) 



The pairwise interaction of samples through the 
kernel intriguingly provides a connection to entropy 
of particles in physics. Particles interacting trough in- 
formation forces (as in the iV-body problem in physics) 
can employ computational techniques developed for 
simulating such large scale systems. The use of entropy 
in training multilayer structures can be studied in the 
backpropagation of information forces framework 
(Erdogmus et al, 2002). The quadratic entropy estima- 
tor was employed in measuring divergences between 
probability densities and blind source separation 
(Hild et al, 2006), blind deconvolution (Lazaro et al, 
2005), and clustering (Jenssen et al, 2006). Quadratic 
expressions with mutual-information-like properties 
were introduced based on the Euclidean and Cauchy- 
Schwartz distances (ED/CSD). These are advantageous 
with computational simplicity and statistical stability 
in optimization (Principe et al, 2000). 

Following the conception of information potential 
and force and principles, the pairwise-interaction 
estimator is generalized to use arbitrary kernels and 
any order a of entropy. The stochastic information 
gradient (SIG) is developed (Erdogmus et al, 2003) to 
train adaptive systems with a complexity comparable 



SOME IDEAS IN AND APPLICATIONS 
OF ITL 

Kernel Machines and Spectral Clustering: KDE has 

been motivated by the smoothness properties inherent to 
reproducing kernel Hilbert spaces (RKHS). Therefore, 
a practical connection between KDE-based ITL, kernel 
machines, and spectral machine learning techniques 
was imminent. This connection was realized and ex- 
ploited in recent work that demonstrates an information 
theoretic framework for pairwise similarity (spectral) 
clustering, especially normalized cut techniques (Shi 
& Malik, 2000). Normalized cut clustering is shown 
to determine an optimal solution that maximizes the 
CSD between clusters (Jenssen, 2004). This connection 
immediately allows one to approach kernel machines 
from a density estimation perspective, thus providing 
a robust method to select the kernel size, a problem 
still investigated by some researchers in the kernel 
and spectral techniques literature. In our experience, 
kernel size selection based on suitable criteria aimed 
at obtaining the best fit to the training data - using 
Silverman's regularized squared error fit (Silverman, 
1986) or leave-one-out cross-validation maximum 
likelihood (Duin, 1976), for instance - has proved to 
be convenient, robust, and accurate techniques that 
avoid many of the computational complexity and load 
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issues. Local data spread based modifications resulting 
in variable-width KDE are also observed to be more 
robust to noise and outliers. 

An illustration of ITL clustering by maximizing the 
CSD between the two estimated clusters is provided 
in Figure 1 . The samples are labeled to maximize 



D cs (p,q) = -\og 



<p,q> 



(3) 



where p and q are KDE for two candidate clusters, f is 
the overall data KDE and the weighted inner product 
to measure angular distance between clusters is 



<p,q> f = \p(x)q(x)f\x)dx 



(4) 



When estimated using a weighted KDE variant, this 
criterion becomes equivalently 



DcsiP'V)' 



I K v f ( x i>yj) 



I Kyfax,) 1 K llf (y i>yj ) 

^ep,x ; ep yi eq,yjeq /^\ 



where K is an equivalent kernel generated from the 
original kernel K (Gaussian here). One difficulty with 
kernel machines is their nonparametric nature, the 
requirement to solve for the eigendecomposition of a 
large positive-definite matrix that has size NxN, for N 
training samples. The solution is a weighted sum of 
kernels evaluated over each training sample, thus the 
test procedure for each novel sample involves evalu- 
ating the sum of N kernels: y test =llL^ k K(x test -x k ). 
The Fast Gauss Transform (FGT) (Greengard, 1991), 
which uses the polynomial expansions for a Gaussian 
(or other) kernel has been employed to overcome this 
difficulty. FGT carefully selects few center points 
around which truncated Hermite polynomial expansions 
approximate the kernel machine. FGT still requires 
heavy computational load in off-line training (minimum 
OiN 2 ), typically 0(]SP)). The selection of expansion 
centers is typically done via clustering (e.g., Ozertem 
& Erdogmus, 2006). 

Correntopy as a Generalized Similarity Metric: 
The main feature of ITL is that it preserves the universe 
of concepts we have in neural computing, but allows 
the adaptive system to extract more information from 
the data. For instance, the general Hebbian principle is 



Figure 1. Maximum CSD clustering of two synthetic benchmarks: training and novel test data (left), KDE using 
Gaussian kernels with Silverman-kernel-size (center), and spectral projections of data on two dominant eigen- 
functions of the kernel. The eigenfunctions are approximated using the Nystrom formula. 
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reduced into a second order metric in traditional artificial 
neural network literature (input-output product), thus 
becoming a synonym for second order statistics. The 
learning rule that maximizes output entropy (instead 
of output variance), using SIG with Gaussian kernels 
IS Aw(n)=T|(x(n)-x(n-l))(y(n)-y(n-l)) (Erdogmus et 
al, 2002), which still obeys the Hebbian principle, 
yet extracts more information from the data (leading 
to the error- whitening criterion for input-noise robust 
learning). 

ITL quantifies global properties of the data, but 
will it be possible to apply it to functions, specifically 
those in RKHS? A concrete example is on similarity 
between random variables, which is typically expressed 
as second order correlation. Correntropy generalizes 
similarity to include higher order moment information. 
The name indicates the strong relation to correlation, 
but also stresses the difference - the average over the 
lags (for random processes) or over dimensions (for 
multidimensional random variables) is the information 
potential, i.e. the argument of second order Renyi's en- 
tropy. For random variables X and 7 with joint density 
p(x,y), correntropy is defined as 



and measures how dense the two random variables are 
along the line x=y in the joint space. Notice that it is 
similar to correlation, which also asks the same question 
in a second moment framework. However, correntropy 
is local to the line x=y, while correlation is quadrati- 
cally dependent upon distances of samples in the joint 
space. Using a KDE with Gaussian kernels 




V(X,Y) = ±-iG(x i -y i ) 

JM i=l 



(7) 



V(X,Y) = lJd(x-y)p(x,y)dxdy 



(6) 



Correntropy is a positive-definite function, thus 
defines a RKHS. Unlike correlation, RKHS is nonlin- 
early related to the input, because all moments of the 
random variable are included in the transformation. 
It is possible to analytically solve for least squares 
regression and principal components in this space, 
yielding nonlinear fits in input space. Correntopy in- 
duced metric (CIM) behaves as the L 2 -norm for small 
distances and progressively approaches the L^norm 
and then converges to L at infinity. Thus robustness to 
outliers is automatically achieved and equivalence to 
Huber's robust estimation can be proven (Santamaria, 
2006). Unlike conventional kernel methods, correntropy 
solutions remain in the same dimensionality as the in- 



Figure 2. Maximum mutual information projection versus kernel LDA test ROC results on hand-written digit 
recognition shown in terms oftype-1 and type-2 errors (left); ROC results (P detect vs P fals J compared for various 
techniques on sonar data. Both data are from the UCI Machine Learning Repository (2007). 
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put vector. This might indicate built-in regularization 
properties, yet to be explored. 

Nonpar am etric Learning in the RKHS: It is 

possible to obtain robust solutions to a variety of 
problems in learning using the nonparametric and local 
nature of KDE and its relationship with RKHS theory. 
Recently, we explored the possibility of designing 
nonparametric solutions to the problem of identifying 
nonlinear dimensionality reduction schemes that main- 
tain maximal discriminative information in a pattern 
recognition problem (quite appropriately measured 
by the mutual information between the data and the 
class labels as agreed upon by many researchers). 
Using the RKHS formalism and based on the KDE, 
results were obtained that consistently outperformed 
the alternative rather heuristic kernel approaches such 
as kernel PCA and kernel LDA (Scholkopf & Smola, 
200 1 ). The conceptual oversight in the latter two is that, 
both PCA and LDA procedures are most appropriate 
for Gaussian distributed data (although acceptable for 
other symmetric unimodal distributions and are com- 
monly but possibly inappropriately used for arbitrary 
data distributions). 

Clearly, the distribution of the data in the kernel 
induced feature space could not be Gaussian for all typi- 
cally exploited kernel selections (such as the Gaussian 
kernel), since these are usually translation invariant, 
therefore the data is, in principle, mapped to an infinite 
dimensional hypersphere on which the data could not 



have been Gaussian distributed (nor symmetrically 
distributed in general for the ideal kernel for a given 
problem since these are positive definite functions). 
Consequently, the hasty use of kernel extensions of 
second-order techniques is not necessarily optimal 
in a meaningful statistical sense. Nevertheless, these 
techniques have found successful applications in various 
problems; however, their suboptimality is clear from 
comparisons with more carefully designed solutions. 
In order to illustrate how drastic the performance dif- 
ference could be, we present a comparison of a mutual 
information based nonlinear nonparametric projection 
approach (Ozertem et al, 2006) and kernel LDA in a 
simplified two-class handwritten digit classification 
case study and sonar mine detection case study. The 
ROC curves of both algorithms on the test set after being 
trained with the same data is shown in Figure 2. The 
kernel is assumed to be a circular Gaussian with size 
set to Silverman's rule-of-thumb. For the sonar data, we 
also include KDE-based approximate Bayes classifier 
and linear LDA for reference. In this example, KLDA 
performs close to mutual information projections, as 
observed occasionally. 



FUTURE TRENDS 

Nonparametric Snakes, Principal Curves and 
Surfaces: More recently, we have been investigating 



Figure 3. Nonparametric snake after convergence from an initial state that was located at the boundary of the 
guitar image rectangle (left). The global principal curve of a mixture often Gaussians obtained according to 
the local subspace maximum definition for principal manifolds (right). 
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the application of KDE and RKHS to nonparametric 
clustering, principal curves and surfaces. Interesting 
mean-shift-like fixed-point algorithms have been 
obtained; specifically interesting is the concepts of 
nonparametric snakes (Ozertem & Erdogmus, 2007) 
and local principal manifolds (Erdogmus & Ozertem, 
2007) that we developed recently. The nonparametric 
snake approach overcomes the principal difficulties 
experienced by snakes (active contours) for image 
segmentation, such as low capture range, data cur- 
vature inhomogeneity, and noisy and missing edge 
information. Similarly, the local conditions for deter- 
mining whether a point is in a principal manifold or 
not provide guidelines for designing fixed point and 
other iterative learning algorithms for identifying such 
important structures. 

Specifically in nonparametric snakes, we treat the 
edgemap of an image as samples and the values of the 
edginess as weights to construct a weighted KDE, from 
which, a fixed point iterative algorithm can be devised 
to detect the boundaries of an object in background. 
The designed algorithm can be easily made robust to 
outlier edges, converges very fast, and can penetrate 
into concavities, while not being trapped into the object 
at missing edge localities. The guitar image in Figure 
3 emphasizes these advantages as the image exhibits 
both missing edges and concavities, while background 
complexity is trivially low as that was not the main 
concern in this experiment - the variable width KDE 
easily avoids textured obstacles. The algorithm could 
be utilized to detect the ridge-boundary of a structure 
in any dimensional data set in other applications. 

In defining principal manifolds, we avoided the 
traditional least-squares error reconstruction type cri- 
teria, such as Hastie's self-consistent principal curves 
(Hastie & Stuetzle, 1 989), and proposed a local subspace 
maximum definition for principal manifolds inspired 
by differential geometry. This definition lends itself to 
a uniquely defined principal manifold hierarchy such 
that one can use inflation and deflation to obtain a d-di- 
mensional principal manifold from a (d+ 1 )-dimensional 
principal manifold. The rigorous and local definition 
lends itself to easy algorithm design and multiscale 
principal structure analysis for probability densities. 
We believe that in the near future, the community 
will be able to prove maximal information preserving 
properties of principal manifolds obtained using this 
definition in a manner similar to mean-shift clustering 



solving for minimum information distortion clustering 
(Rao et al, 2006) and maximum likelihood modelling 
achieving minimum Kullback-Leibler divergence 
asymptotically (Carreira-Perpinan & Williams, 2003; 
Erdogmus & Principe, 2006). 



CONCLUSION 

The use of information theoretic learning criteria in 
neural networks and other adaptive system solutions 
have so far clearly demonstrated a number of advan- 
tages that arise due to the increased information content 
of these measures relative to second-order statistics 
(Erdogmus & Principe, 2006). Furthermore, the use of 
kernel density estimation with smooth kernels allows 
one to obtain continuous and differentiable criteria 
suitable for iterative descent/ascent-based learning 
and the nonparametric nature of KDE and its variants 
(such as variable-size kernels) allow one to achieve 
simultaneously robustness, global optimization through 
kernel annealing, and data modeling flexibility in de- 
signing neural networks and learning algorithms for a 
variety of benchmark problems. Due to lack of space, 
detailed mathematical treatments cannot be provided 
in this article; the reader is referred to the literature 
for details. 
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KEY TERMS 

Cauchy-Schwartz Distance: An angular density 
distance measure in the Euclidean space of probability 
density functions that approximates information theo- 
retic divergences for nearby densities. 

Correntropy: A statistical measure that estimates 
the similarity between two or more random variables 
by integrating the joint probability density function 
along the main diagonal of the vector space (line along 
ones). It relates to Renyi's entropy when averaged over 
sample-index lags. 



Information Theoretic Learning: A technique 
that employs information theoretic optimality criteria 
such as entropy, divergence, and mutual information 
for learning and adaptation. 

Information Potentials and Forces: Physically 
intuitive pairwise particle interaction rules that emerge 
from information theoretic learning criteria and govern 
the learning process, including backpropagation in 
multilayer system adaptation. 

Kernel Density Estimate: A nonparametric tech- 
nique for probability density function estimation. 

Mutual Information Projections: Maximally 
discriminative nonlinear nonparametric projections 
for feature dimensionality reduction based on the 
reproducing kernel Hilbert space theory. 

Renyi Entropy : Ageneralized definition of entropy 
that stems from modifying the additivity postulate and 
results in a class of information theoretic measures that 
contain Shannon's definitions as special cases. 

Stochastic Information Gradient: Stochastic 
gradient of nonparametric entropy estimate based on 
kernel density estimation. 
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INTRODUCTION 

This chapter is focused on the analysis and classifica- 
tion of arrhythmias. An arrhythmia is any cardiac pace 
that is not the typical sinusoidal one due to alterations 
in the formation and/or transportation of the impulses. 
In pathological conditions, the depolarization process 
can be initiated outside the sinoatrial (SA) node and 
several kinds of extra-systolic or ectopic beatings can 
appear. 

Besides, electrical impulses can be blocked, ac- 
celerated, deviated by alternate trajectories and can 
change its origin from one heart beat to the other, thus 
originating several types of blockings and anomalous 
connections. In both situations, changes in the signal 
morphology or in the duration of its waves and intervals 
can be produced on the ECG, as well as a lack of one 
of the waves. 

This work is focused on the development of intel- 
ligent classifiers in the area of biomedicine, focusing 
on the problem of diagnosing cardiac diseases based 
on the electrocardiogram (ECG), or more precisely on 
the differentiation of the types of atrial fibrillations. 
First of all we will study the ECG, and the treatment 



of the ECG in order to work with it, with this specific 
pathology. In order to achieve this we will study dif- 
ferent ways of elimination, in the best possible way, 
of any activity that is not caused by the auriculars. We 
will study and imitate the ECG treatment methodologies 
and the characteristics extracted from the electrocardio- 
grams that were used by the researchers that obtained 
the best results in the Physionet Challenge, where the 
classification of ECG recordings according to the type 
of Atrial Fibrillation (AF) that they showed, was re- 
alised. We will extract a great amount of characteristics, 
partly those used by these researchers and additional 
characteristics that we consider to be important for the 
distinction mentioned before. A new method based on 
evolutionary algorithms will be used to realise a selec- 
tion of the most relevant characteristics and to obtain 
a classifier that will be capable of distinguishing the 
different types of this pathology. 



BACKGROUND 

The electrocardiogram (ECG) is a diagnostic tool that 
measures and records the electrical activity of the heart 
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in exquisite detail (Lanza 2007). Interpretation of these 
details allows diagnosis of a wide range of heart condi- 
tions. The QRS complex is the most striking waveform 
within the electrocardiogram (Figure 1). Since it re- 
flects the electrical activity within the heart during the 
ventricular contraction, the time of its occurrence as 
well as its shape provide much information about the 
current state of the heart. Due to its characteristic shape 



it serves as the basis for the automated determination 
of the heart rate, as an entry point for classification 
schemes of the cardiac cycle, and often it is also used 
in ECG data compression algorithms. 

A normal QRS complex is 0.06 to 0.10 sec (60 to 
100 ms) in duration. In order to have a signal clean 
of auricular activity in the ECG, we will analyse and 
compare performances from these two different ap- 
proaches: 




Figure 1. Diagram of the QRS complex 




1. 



2. 



To remove the activity of the QRS complex, sub- 
tracting from the signal a morphological average 
of its activity for every heart beat, 
To detect the TQ section among heart beats (which 
are zones clean of ventricular activity) and analyse 
only data from that section. 



There exists a great variety of algorithms to carry 
out the extraction of the auricular activity from the 
electrocardiogram such as the Thakor method (a recur- 
rent adaptive filter structure), adaptive filtering of the 
whole band, methods based on neural-networks, spa- 
tial-temporal cancellation methods and methods based 
on Wavelets or on the concept of Principal Component 
Analysis (PCA) (Castells et al. 2004, Gilad-Bachrach 
et al. 2004, Petrutiu et al. 2004). 

A fundamental step in any of these approaches is 
the detection of the QRS complex in every heart beat. 
Software QRS detection has been a research topic for 



Figure 2. The segments are shown detected by the algorithm on the two channels of a registration. In green the 
end of the wave T is shown, and in red the principle of the wave Q. Therefore each tract among final of wave 
T (green) and wave principle Q (red), it is a segment of auricular activity. The QRST complex is automatically 
detected with good precision. 
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more than 30 years. Once the QRS complex is identi- 
fied, we will have a starting point to implement some 
different techniques for the QRST removal. Figure 2 
show how the QRST is automatically detected. This 
is the first step in the analysis of the ECG. 

The study and analysis of feature extraction tech- 
niques from ECG signals is a very common task in any 
implementation of automatic classification systems 
from signals of any kind. During the execution of 
this sub-task, it is very important to analyse different 
research results existing in the literature. 

It is important to analyse the use of the frequency 
domain to obtain the Dominant Atrial Frequency (D AF) 
which is an index of the auricular activity which mea- 
sures the dominant frequency in the frequency spectrum 
that can be obtained from the auricular activity signal. 
In this spectrum, for each ECG record, the maximum 
energy peak is calculated, and this frequency will be the 
one that dominates the spectrum (Cantini et al. 2004). 
It is also important to use the RR distance, and dif- 
ferent filters in the 4-10Hz range, using a Butterworth 
filter of first order. It is important to note the MUSIC 
(Multiple Signal Clasification) method of order 12 
to calculate the pseudo-periodogram of the signal. In 
order to obtain more robust estimations, signal filter- 
ing by variable-length windows, with no overlapping, 
and on every one of them, an analysis of the frequency 
spectrum can be performed. It is also important to note 
the Welch method, the Choi-Williams transform, and 
some heuristical methods used by cardiology experts 
(Atrial Fibrillation, 2007). 



GENETIC PROGRAMMING 

The genetic programming (GP) can be understood as an 
extension of the genetic algorithm (GA) (Zhao, 2007). 
GP began as an attempt to discover how computers 
could learn to solve problems, in different fields, like 
automatic desing, function approximation, classifica- 
tion, robotic control, signal processing, without being 
explicitly programmed to do so (Koza, 2003). Also, in 
bio-medical application, GP has been extensively and 
satisfactorily used (Lopes, 2007). The primary differ- 
ences between GAS and GP can be summarised as a) 
GP typically codes solutions as tree structured, variable 
length chromosomes, while GA's generally make use 
of chromosomes of fixed length and structure, b) GP 
typically incorporates a domain specific syntax that 



governs acceptable (or meaningful) arrangements of 
information on the chromosome. For GA's, the chro- 
mosomes are typically syntax free. 

The field of program induction, using a tree-struc- 
tured approach, was first clearly defined by Koza 
(Koza, 2003). The following steps summarise the search 
procedure used with GP. 



1. 



2. 



3. 



Create an initial population of programs, randomly 
generated as compositions of the function and 
terminal sets. 
WHILE termination criterion not reached DO 

(a) Execute each program to obtain a perfor- 
mance (fitness) measure representing how 
well each program performs the specified 
task. 

(b) Use a fitness proportionate selection method 
to select programs for reproduction to the 
next generation. 

(c) Use probabilistic operators (crossover and 
mutation) to combine and modify compo- 
nents of the selected programs. 

The fittest program represents a solution to the 
problem. 



A NEW INTELLIGENT CLASSIFIER 
BASED ON GENETIC PROGRAMMING 
FOR ECG. 

In the different articles we have studied, the authors did 
not use any algorithmic method in order to try to classify 
the electrocardiograms (Cantini et al. 2004, Lemay et 
al. 2004). The authors applied simple methods to try to 
establish the possible classification based on the clas- 
sification capacity of one single characteristic or pairs 
of characteristics (through a graphic representation) 
(Hay et al. 2004). Nevertheless, the fact that one single 
characteristic might not be perfect individually to clas- 
sify a group of patterns in the different categories, does 
not mean that combined with another or others it does 
not obtain some high percentages in the classification. 
Due to the great quantity of characteristics obtained 
from the ECG, a method to classify the patterns was 
needed, alongside a way of selecting the subgroup of 
characteristics optimal for classifying, since the great 
quantity of existing characteristics would introduce 
noise as soon as the search for the optimal classifier of 
thepatterns of characteristics begins. In total 55 different 
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characteristics were used, from the papers (Cantini et 
al. 2004, Lemay et al. 2004, Hayn et al. 2004, Mora et 
al. 2004). There are other paper in the bibliography that 
used soft-computing method to analyze ECG (Wiggins 
et al. 2008, Lee et al. 2007, Yu et al. 2007). 

In this paper, a new intelligent algorithm based on 
genetic programming (one paradigm of the soft-comput- 
ing area) for simultaneously select the best features is 
proposed for the problem of classification spontaneous 
termination of atrial fibrillation. In this algorithm genetic 
programming is used to search for a good classifier at 
the same time as the search for an optimal subgroup of 
characteristics. The algorithm consists of a population 
of classifiers, and each one of those is associated with a 
fitness value that indicates how well it classifies. Each 
classifier is made up of: 

1. Abinary vector of characteristics, which indicates 
with l's the characteristics it uses. 

2. A multitree with as many trees as classes as has 
the collection of data of the problem. Every tree 
z distinguishes between the class z (giving a posi- 
tive output) and the rest of the classes (negative 
output). Furthermore, it is connected to values p. 
(frequency of failures), and wj (frequency of suc- 
cesses). The trees are made up of function nodes 
[+,-,*,/, trigonometric functions (sine, cosine, 
etc.), statistic functions (minimums, maximums, 
average)] and terminal nodes {constant number 
and features } . Their translation to a mathematical 
formula is immediate. 



The algorithm consists of a loop in which in each 
repetition a new population is formed from the previ- 
ous through the genetic operators. The classifiers that 
score the highest on fitness will have more possibilities 
to participate, with which the population will tend to 
improve its quality with the successive generations. 
The proposed algorithm is composed of the following 
building blocks: 

1. Fitness function. The fitness function combines 
the double objective of achieving a good classi- 
fication and a small subgroup of characteristics: 




/ 



Fitness = f • 



1 + oce " 



v 



(i) 



In this equation, f is the sum of the cases of success 
in the classification of the trees, (3 is the cardinality of 
the feature subset used, n is the total number of features 
and a is a parameter which determines the relative im- 
portance that we want to assign for correct classification 
and the size of the feature subset, calculated as: 



oc=C l 



gen 



TotalGen 



(2) 



where C is a constant, and TotalGen is the number of 
generations proposed genetic algorithm is evolved, 
and gen is the current generation number. 



Figure 3. An example of a crossover operation in the proposed multitree classifier, (a) and (b) are initially the 
classifiers PI and P2. In the figures (c) and (d) the results of the crossover operator is presented. 




C\<\ & 
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2. Reproduction operator: a classifier chosen 
proportionally to the fitness passes on, intact, to 
the next generation. 

3. Mutation operator: a classifier is selected ran- 
domly and nodes of a tree are changed, giving 
more probability to the worst trees. 

4. Crossover operator: homogeneous cross 
(classifiers with the same characteristics) and 
heterogeneous cross (classifiers with a similar 
subgroup). It realises the exchange of subtrees 
and trees between the classifiers. Figure 3 shows 
the behaviour of this operator. 

It was thought to be useful to value the character- 
istics first, and use this assessment when a subgroup 
would be assigned to the classifier. This is performed 
in the following steps: 

A probability is given to each characteristic of 
being assigned to the initial subgroup of the clas- 
sifier proportional to its assessment. 
G-flip was used to assess the characteristics (Gilad- 
Bacharach et al. 2004). G-flip is a greedy search 
algorithm for maximizing an evaluation function 
that take into account the number of features se- 
lected. The algorithm repeatedly iterates over the 
feature set and updates the set of chosen features. 
In each iteration it decides to remove or add the 
current feature to the selected set by evaluating 
the margin term of the evaluation function with 
and without this feature. This algorithm is similar 



to the zero-temperature Monte-Carlo (Metropolis) 
method. It converges to a local maximum of the 
evaluation function, as each step increases its 
value and the number of possible feature sets is 
finite. 

The proposed methodology devalues bad char- 
acteristics in groups with a large quantity of 
characteristics, thus accelerating their conver- 
gence to good groups of characteristics and good 
classification results. 



SIMULATION RESULTS 

We have used and compared two different new intel- 
ligent classifiers. The first one presents an online fea- 
ture selection algorithm using genetic programming. 
The proposed genetic programming methodology 
simultaneously selects a good subset of features and 
constructs a classifier using the selected features for 
the problem of ECG classification. We have designed 
new genetic operator in order to produce a robust and 
precise algorithm. The other classifier is based in the 
hybridization of a feature selection algorithm and a 
neural network system based on kernel method (Sup- 
port Vector Machine). 

We have four classification task: 

S Event A: To differ among registration N (Group 
N: non-terminating AF -defined as AF that was 
not observed to have terminated for the duration 



Table 1. Comparison of different approaches (in bracket the standard deviation) 



Method: 


Infogain 
(Molina et al. 
2002) 


New evolutive 
algorithm for 
classification 


Kernel method 
(Support Vector 
Machine) 
(Scholkopf et al. 
2002) 


Relief (Kononenko 
1994) 


Task 


Best 


Median/ 
(error) 


Best 


Median 


Best 


Median 


Best 


Median 


Event A: 


93 


91 (±2) 


100 


98 (±2) 


100 


98 (±2) 


72 


64 (±8) 


Event B: 


70 


66 (±4) 


95 


81 (±14) 


80 


68 (±12) 


80 


74 (±6) 


Event C: 


96 


88 (±6) 


89 


83 (±6) 


84 


75 (±9) 


74 


68 (±6) 


Event D: 


68 


62 (±4) 


85 


80 (±5) 


83 


77 (±6) 


53 


49 (±4) 
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S 



of the long-term recording, at least an hour fol- 
lowing the segment-) and registration T (Group 
T: AF that terminates immediately (within one 
second) after the end of the record). 
Event B: To differ among the type registrations 
S (Group S: AF that terminates one minute after 
the end of the record) and those of type T. 
Event C: To differ among registrations type N 
of AF and a second group in which registrations 
type S and type T are included. 
Event D : Separation of the 3 types of registrations 
in a simultaneous way. 



These groups N,T and S are distributed across a 
learning set (consisting of 10 labelled records from 
each group) and two test sets. Test set A contains 30 
records, of which about one-half are from group N, 
and of which the remainder are from group T. Test 
set B contains 20 records, 10 from each of groups S 
and T. Table 1 shows the simulation results (in % of 
classification), for different method and the evolutive 
algorithm proposed for ECG classification: 



FUTURE TRENDS 



selects the required features while design the multitree 
classifier. 

Different genetic operator has been design for the 
multitree classifier, and for a better performance of the 
classifier, the initialization process generates solution 
using smaller feature subsets with has been previously 
selected with a greedy search algorithm (G-Flips) for 
maximizing the evaluation function. The effective- 
ness of the proposed scheme is demonstrated in a real 
problem: The Classification Spontaneous Termination 
of Atrial Fibrillation. At this point, it is important to note 
that the use of different characteristic gives different 
classification result as can be observed by the authors 
working in this challenge. The selection of different 
features extracted from an electrocardiogram has a 
strong influence on the problem to be solve and in the 
behaviour of the classifier. Therefore it is important to 
develop a general tool able to be face with different 
cardiac illnesses, which can select the most appropriate 
features in order to obtain an automatic classifier. As it 
can be observed, the proposed methodology has very 
good result compared with the winner of the challenge 
from PhysioNet and Computers in Cardiology 2004, 
even if this methodology has been developed in a general 
way to resolved different classification problems. 




The field of signal processing in bio-medical problems is 
an exciting and increasingly field nowadays. The rapid 
development of powerful microcomputers promoted 
the widespread application of software for electro- 
cardiogram analysis and QRS detection algorithms in 
cardiological devices, and automatic classifiers. 

However, and important research field for the next 
year, will be the hybridization of new intelligent tech- 
niques, as genetic algorithm and genetic programming, 
or other paradigms from soft-computing (fuzzy logic, 
neural networks, S VM, etc.), that improve the behaviour 
of standard classification algorithm for the diagnosis 
of different cardiological pathologies. 



CONCLUSIONS 

In this paper, a new online feature selection algo- 
rithm using genetic programming technique has been 
proposed as classifier for classification spontaneous 
termination of atrial fibrillation. In a combined way, 
our genetic programming methodology automatically 
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KEY TERMS 

Arrhythmia: Arrhythmias are disorders of the 
regular rhythmic beating of the heart. Arrhythmias 
can be divided into two categories: ventricular and 
supraventricular. 

Atrial Fibrillation: The atrial fibrillation (AF) is 
the sustained arrhythmia that is most frequently found 
in clinical practice, present in 0.4% of the total popu- 
lation. Its frequency increases with age and with the 
presence of structural cardiopathology. AF is especially 
prevalent in the elderly, affecting 2-5% of the popula- 
tion older than 60 years and 10 percent of people older 
than 80 years. 

Electrocardiogram: The electrocardiogram (ECG) 
is a diagnostic tool that measures and records the elec- 
trical activity of the heart 

Feature Selection: Feature selection is a process 
frequently used in classification algorithm, wherein 
a subset of the features available from the data are 
selected for the classifier. The best subset contains 
the least number of dimensions or features that most 
contribute to a correct classification process. 

Genetic Algorithm: Genetic Algorithms (GA) are 
a way of solving problems by mimicking the same 
processes mother nature uses. They use the same com- 
bination of selection, recombination and mutation to 
evolve a solution to a problem. 

Genetic Programming: Genetic Programming 
(GP), evolved a solution in the form of a Lisp pro- 
gram using an evolutionary, population-based, search 
algorithm which extended the fixed-length concepts 
of genetic algorithms. 

Soft-Computing: Refers to a collection of differ- 
ent paradigms (such as fuzzy logic, neural networks, 
simulated annealing, genetic algorithms and other 
computational techniques), which are focussed in ana- 
lyze, model and discover information in very complex 
problems. 

Support Vector Machine (SVM): Are a special 
Neural Networks that performs classification by con- 
structing an N-dimensional hyperplane that separates 
the data into two categories. 
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INTRODUCTION 

The concept of agent has been successfully used in a 
wide range of applications such as Robotics, e-com- 
merce, agent-assisted user training, military transport 
or health-care. The origin of this concept can be located 
in 1977, when Carl Hewitt proposed the idea of an 
interactive object called actor. This actor was defined 
as a computational agent, which has a mail address 
and a behaviour (Hewitt, 1977). Actors receive mes- 
sages from other actors and carry out their tasks in a 
concurrent way. 

It is difficult that a single agent could be sufficient 
to carry out a relatively complex task. The usual ap- 
proach consists of a society of agents - called Multiagent 
Systems (MAS) -, which communicate and collaborate 
among them and they are coordinated when pursuing 
a goal. 

The purpose of this chapter is to analyze the aspects 
related to the application of MAS to System Engineer- 
ing and Robotics, focusing on those approaches that 
combine MAS with other Artificial Intelligence (AI) 
techniques. 



BACKGROUND 

There is not an academic definition accepted by every 
researcher about the term agent. In fact, agent research- 
ers have offered a variety of definitions explicating his 
or her particular use of the word. An extensive list of 
these definitions can be found in (Franklin and Graesser, 
1996). It does not fall in the scope of this chapter to 
reproduce that list. However, we will include some of 
them, in order to illustrate how heterogeneous these 
definitions are. 



"Autonomous agents are computational systems that 
inhabit some complex dynamic environment, sense and 
act autonomously in this environment, and by doing 
so realize a set of goals or tasks for which they are 
designed. " (Maes, 1995, p. 108) 

"Autonomous agents are systems capable of autono- 
mous, purposeful action in the real world. " (Brustoloni, 
1991, p. 265) 

"An agent is anything that can be viewed as perceiving 
its environment through sensors and acting upon that 
environment through effectors. " (Russell and Norvig, 
1995, p. 31) 

Despite the existing plethora of definitions, agents 
are often characterized by only describing their features 
(long-live, autonomy, reactivity, proactivity, collabora- 
tion, ability to perform in a dynamic and unpredictable 
environment, etc.). With these characteristics, users 
can delegate to agents tasks designed to be carried out 
without human intervention, for instance, as personal 
assistants that learn from its user. 

In most of applications, a standalone agent is not 
sufficient for carrying out the desired task: agents are 
forced to interact with other agents, forming a MAS. 
Due to their capacity of flexible autonomous action, 
MAS can treat with open - or at least highly dynamic 
or uncertain- environments. On the other hand, MAS 
can effectively manage situations where distributed 
systems are needed: the problem being solved is itself 
distributed, the data are geographically distributed, sys- 
tems with many components and huge content, systems 
with future extensions, etc. A researcher could include 
a single agent to implement all the tasks. Nevertheless, 
this type of macroagent represents a bottleneck for the 
system speed, reliability and management. 
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It is clear that the design of a MAS is more com- 
plex than a single agent. Apart from the code for the 
treatment of the task-problem, a developer needs to 
implement those aspects related to communication, 
negotiation among the agents and its organization in 
the system. Nevertheless, it has been shown that MAS 
offer more than they cost (Cockburn, 1 996) (Gonzalez, 
2006) (Gonzalez, 2006b) (Gyurjyan, 2003) (Seilonen 
2005). 



MAS, Al AND SYSTEM ENGINEERING 

An important topic in System Engineering is that of 
process control problem. We can define it as the one of 
manipulating the input variables of a dynamic system 
in an attempt to influence over the output variables in a 
desired fashion, for example, to achieve certain values 
or certain rates ( Jacquot, 1 98 1 ). In this context, as other 
Engineering disciplines, we can find a lot of relevant 
formalisms and standards, whose descriptions are out 
of the scope of this chapter. An interested reader can 
get an introductory presentation of these aspects in 
(Jacquot, 1981). 

Despite their advantages, there are few approaches 
to the application of MAS technology to process auto- 
mation (much less than applications to other fields such 
as manufacturing industry). Some reasons for this lack 
of application can be found in (Seilonen, 2005): 

Process automation requires run-time specifica- 
tions that are difficult to reach by the current agent 
technology. 

The parameters in the automation process design 
are usually interconnected in a strict way, thus 
it is highly difficult to decompose the task into 
agent behaviors. 

Lack of parallelism to be modeled through 
agents. 

In spite of these difficulties, some significant ap- 
proaches to the application of MAS to process control 
can be distinguished: 

An interesting approach of application of MAS to 
process control is that in which communication 
techniques among agents are used as a mecha- 
nism of integration among systems independently 
designed. An example of this approach is the 



ARCHON (Architecture for Cooperative Hetero- 
geneous on-line systems) architecture (Cockburn, 
1996) that has been used in at least three engineer- 
ing domains: Electricity Transportation, Electric- 
ity Distribution and Particle Accelerator Control. 
In ARCHON, each application program (known 
as Intelligent System) is provided with a layer 
(called Archon Layer) that allows it to transfer 
data/messages to other Intelligent Systems. 
A second approach consists of those systems that 
implement a closed loop-based control. In this 
sense, we will cite the work of (Velasco et al., 
1996) for the control of a thermal central. 
A different proposal consists of complementing a 
pre-existing process automation system with agent 
technology. In other words, it is a complementa- 
tion, not a replacement. The agent system is an 
additional layer that supervises the automation 
system and reconfigures it when it is necessary. 
Seilonen et al. also propose a specification of 
a BDI-model-based agent platform for process 
automation (Seilonen, 2005). 
V. Gyurjyan et al. (2003) propose a controller 
system architecture with the ability of combining 
heterogeneous processes and/or control systems 
in a homogeneous environment. This architecture 
(based on the FIPA standard) develops the agents 
as a level of abstraction and uses a description 
of the control system in a language called COOL 
(Control Oriented Ontology Language). 
Tetiker et al. (2006) propose a decentralized 
multi-layered agent structure for the control of 
distributed reactor networks where local control 
agents individually decide on their own obj ectives 
allowing the framework to achieve multiple local 
objectives concurrently at different parts of the 
network. On top of that layer, a global observer 
agent continuously monitors the system. 
Horling, Lesser et al. (2006) describe a soft real- 
time control architecture designed to address tem- 
poral and ordering constraints, shared resources 
and the lack of a complete and consistent world 
view. From challenges encountered in a real-time 
distributed sensor allocation environment, the 
system is able to generate schedules respecting 
temporal, structural and resource constraints, to 
merge new goals with existing ones, and to detect 
and handle unexpected results from activities. 
Other proposal of real-time control architecture 
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is CIRCA (A Cooperative Intelligent Real-Time 
Control Architecture) by Musliner, Durfee and 
Shin (1993), that uses separate AI and real-time 
subsystems to address the problems for which 
each is designed. 

In this context, we proposed a MAS (called MAS- 
CONTROL) for identification and control of processes, 
whose design follows the FIPA specifications (FIPA, 
2007) regarding architecture, communication and pro- 
tocols. This MAS implements a self-tuning regulator 
(STR) scheme, so this is not a new general control 
algorithm but a new approach for its development. Its 
main contribution consists of showing the potential that 
a controller, through the use of MAS and ontologies 
- expressed in OWL (Ontology Web Language)-, can 
control systems in an autonomous way, using actions 
whose description, for example, is on the web, and 
can read on it (without knowing a priori) the logic of 
how to do the control. In this context, our experience 
is that agents do not offer any advantage if they are not 
intelligent and ontologies represent an intelligent way 
to manage knowledge since they provides the common 
format in which they can express that knowledge. Two 
important advantages of their use are extensibility and 
communication with other agents sharing the same 
language. These advantages are shown in the particular 
case of open systems, that is, when different MAS from 
different developers interact (Gonzalez, 2006). 

As a STR, our MAS tries to carry out the processes 
of identification and control of a plant. We consider 
that this model can be properly managed by a MAS 
due to two main reasons: 

A STR scheme contains modules that are con- 
ceptually different, such as the direct interaction 
with the plant to control, identification of the 
system and determination of the best values for 
the controller parameters. 
It is possible to carry out the calculations in a par- 
allel way. For instance, several transfer functions 
could be explored simultaneously. Thus, several 
agents can be launched in different computers, 
taking advantage of the possibility of parallelism 
provided by the MAS. 

Other innovator aspect of this work is the use of 
artificial neural networks (ANN) for the identification 
and determination of the parameters. ANN and STR 



present clear analogies. The training of a neural network 
consists of finding the best values of the weights of the 
network while it is necessary to optimize some param- 
eters for a model (identification) or for a controller in 
a STR. Because of this similarity of methods, we have 
considered the application of ANN training methods to 
control problems. In this case, ANN are applied for two 
purposes: the parameter optimization of a model of the 
unknown system and the optimization of the parameters 
of a controller. This way, the resulting system could 
be seen as a hybrid intelligent system for a real-time 
application. An interested reader can get a deeper de- 
scription of the system in (Gonzalez, 2006b). 

It is important to remark that this framework can be 
used for every algorithm of identification and control. 
In this context, we have checked the MAS control- 
ling several and different plants, obtaining a proper 
behavior. In contrast, due to the transmission rate and 
optimization time, the designed MAS should be used 
for the controlling of not-excessively fast processes, 
according to the first restriction stated above. However, 
we expect to have shown an example of how the other 
two (strong interdependency of the parameters and lack 
of parallelism) can be overcome. 

As can be seen, the mentioned restrictions often 
become serious obstacles in the application of MAS 
to Engineering Systems. In this framework, the use of 
Fuzzy rules is a very usual solution in order to define 
single-agents behaviours (Hoffmann, 2003). Unfor- 
tunately, the definition of the rules is cumbersome in 
most cases. As a possible solution to the difficult task 
of generating the adequate rules, several automatic 
algorithms have been proposed. New rule extraction 
approaches based on connectionist models have been 
proposed. Among them, the Neuro-Fuzzy systems has 
been proven as a way to obtain the rules, taking advan- 
tage of the learning properties of the Neural Networks 
and the form of expressing the knowledge by Fuzzy 
rules (Mitra and Hayashi, 2000). 

In this context, several applications have been 
developed. In Robotics applications, it could be cited 
the work of (Lee and Qian, 1998), who describe a 
two-component system for picking up moving objects 
for a vibratory feeder or the work of (Kiguchi, 2004), 
proposing a hierarchical neuro-fuzzy controller for 
a robotic exoskeleton, to assist motion of physically 
weak persons such as elderly, disabled, and injured 
persons. As a particular case, a system for the detection 
and identification of road markings will be presented 
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in this chapter. This system has been incorporated to 
a vehicle as it can be seen in Figure 1 . 

This system is based on infrared technology and 
a classification tool based on a Neuro-Fuzzy System. 
A particular feature to take into account in this kind 
of tasks is that the detection and classification have 
to be done in real time. Hence, the time consumed by 
the hardware system and the processing algorithms is 
critical in order to take a right decision within the time 
frame of its relevance. Looking for an inexpensive 
and fast system, the infrared technology is a good 
alternative solution in this kind of applications. In this 
direction, taking into account the time limitations, a 
combination between a device based on infrared tech- 
nology and different techniques to extract convenient 
Fuzzy rules are used (Marichal, 2006). It is important 
to remark that the extraction and the interpretation of 



the rules have generated great interest in recent years 
(Guillaume, 2007). 

The final purpose is to achieve a MAS, where each 
agent does its work as fast as possible, overcoming 
the temporal limitations of the MAS as pointed out 
by (Seilonen, 2005). In this context, we would like to 
remark some approaches of MAS applied to decision 
fusion for distributed sensor systems, in particular 
that by Yu and Sycara (2006). In order to achieve the 
mentioned MAS, it is necessary to obtain the rules 
for each agent. Furthermore, a depth analysis over the 
rules has to be done, minimizing the number of them 
and setting the mapping between these rules and the 
different scenarios. 

The approach used in the shown case is based on 
designing rules for each situation found by the vehicle. 
In fact, each different scenario should be expressed 



Figure 1. Infrared system under the vehicle 
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Table 1. Rules extracted by the neuro-fuzzy approach 



Range 

Reference 
Value 



[0 2) [2 4) [4 6) [6 8] 

13 5 7 




Arrow Right Yield Forward- Other 
Arrow right Rules 

Arrow 



[-10] 



Rules 6, 7 8,9, 13,14, 19,20,21, 1,2, 

10,11, 15,16, 22,23,24, 3,4, 

12 17,18 25 5 



by its own rules. This feature gives more flexibility in 
the process of designing the desired MAS. Because 
of that, the separation of rules according to the kind 
of road marking could help in this purpose. In Table 
1, it is shown the result of this process for the infrared 
system shown in Figure 1. Note that, the reference 
values are the values associated with each road mark- 
ing, the range refers to the interval where the output 
values of the resultant Fuzzy system could be for a 
particular sign and finally, the rules are indicated by 
an order number. 

It is important to remark that it is necessary to 
interpret the obtained rules. In this way, it is possible 
to associate these rules with different situations and 
generate new rules more appropriate for a particular 
case under consideration. Hence, the agents related with 
the detection and classification of the signs could be 
expressed by this set of Fuzzy rules. Moreover, agents, 
which are in charge of taking decisions based on the 
information, provided by the detection and classification 
of a particular road marking, could incorporate these 
rules as part of them. Problems in task decomposition 
process, pointed out by (Seilonen, 2005), could be 
simplified in this way. On the other hand, although 
the design of behaviors is very important, it should be 
said that the issues related with the co-operation among 
agents are also essential. In this context, the work of 
(Howard et al, 2007) could be cited. 



FUTURE TRENDS 

As technology provides faster and more efficient 
computers, the application of AI techniques to MAS 
is supposed to become increasingly popular. That im- 
provement in the computer capacity and some emerging 
techniques (meta-level accounting, schedule caching, 
variable time granularities, etc.) (Horling, Lesser et 
al., 2006) will imply that other AI methods- impos- 
sible to be currently applied in the field of System 
Engineering- will be introduced in an efficient way in 
a near future. 

In our opinion, other important feature to be explored 
is the improvement in MAS communication. It is also 
convenient to look for more efficient MAS protocols 
and standards, in addition to those aspects related to 
new hardware features. These improvements would 
allow, for example, developing operative real-time 
tele-operated applications. 



CONCLUSION 

The application of MAS to Engineering Systems and 
Robotics is an attractive platform for the convergence 
of various AI technologies. This chapter shows in 
a summarized manner how different AI techniques 
(ANN, Fuzzy rules, Neuro-Fuzzy systems) have been 
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successfully included into MAS technology in the field 
of System Engineering and Robotics. These techniques 
can also overcome some of the traditionally described 
drawbacks for MAS application, in particular, highly 
difficult decomposition of the task into agent behav- 
iors and lack of parallelism to be modeled through 
agents. 

However, present-day MAS technology does not 
fulfill completely the severe real-time requirements 
that are implicit in automation processes. Thus, and 
until the technology provides faster and more efficient 
computers, our opinion is that the application of AI 
techniques in MAS needs to be optimized for real-time 
systems, for example, extracting convenient Fuzzy 
rules and minimizing its number. 
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KEY TERMS 

Artificial Neural Network: An organized set of 
many simple processors called neurons that imitates 
a biological neural configuration. 

FIPA: It stands for "Foundation for Intelligent 
Physical Agents", IEEE Computer Society standards 
organization that promotes agent-based technology 
and the interoperability of its standards with other 
technologies 

MultiAgent System: System composed of several 
agents, usually designed to cooperate in order to reach 
a goal. 

Neuro-Fuzzy: Hybrids of Artificial neural networks 
and Fuzzy Logic. 

Ontology: Set of classes, relations, functions, etc. 
that represents knowledge of a particular domain. 

Real-Time System: System with operational dead- 
lines from event to system response. 

Self-Tuning Regulator: Type of adaptive control 
system composed of two loops, an inner loop (process 
and ordinary linear feedback regulator), and an outer 
loop (recursive parameter estimator and design calcula- 
tion which adjusts its parameters). 
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INTRODUCTION 



BACKGROUND 



The query answering system realizes the selection of 
the data, preparation, pattern discovering, and pattern 
development processes in an agent-based structure 
within the multi agent system, and it is designed to 
ensure communication between agents and an effective 
operation of agents within the multi agent system. The 
system is suggested in a way to process and evaluate 
fuzzy incomplete information by the use of fuzzy 
SQL query method. The modelled system gains the 
intelligent feature, thanks to the fuzzy approach and 
makes predictions about the future with the learning 
processing approach. 

The operation mechanism of the system is a pro- 
cess in which the agents within the multi agent system 
filter and evaluate both the knowledge in databases 
and the knowledge received externally by the agents, 
considering certain criteria. The system uses two 
types of knowledge. The first one is the data existing 
in agent databases within the system and the latter is 
the data agents received from the outer world and not 
included in the evaluation criteria. Upon receiving data 
from the outer world, the agent primarily evaluates it 
in knowledgebase, and then evaluates it to be used 
in rule base and finally employs a certain evaluation 
process to rule bases in order to store the knowledge 
in task base. Meanwhile, the agent also completes the 
learning process. 

This paper presents an intelligent query answer- 
ing mechanism, a process in which the agents within 
the multi-agent system filter and evaluate both the 
knowledge in databases and the knowledge received 
externally by the agents. The following sections in- 
clude some necessary literature review and the query 
answering approach Then follow the future trends and 
the conclusion. 



The query answering system in agents utilizes fuzzy 
SQL queries from the agents, then creates and optimizes 
a query plan that involves the multiple data source of 
the whole multi agent system. Accordingly, it controls 
the execution of the task to generate the data set. The 
query operation constitutes the basic function of query 
answering. By query operation, the most important fun- 
ction of the system is fulfilled. This study also discusses 
peer to peer network structure and SQL structure, as 
well as query operation. 

Query operation was applied in various fields. For 
example, selecting the related knowledge in a web 
environment was evaluated in terms of relational 
concept in databases. Relational database system par- 
ticularly assists the system in making evaluations for 
making decisions about the future and in making the 
right decisions with fuzzy logic approach (Raschia & 
Mauaddib, 2002; Tatarinov et al. 2003; Galindo et al. 
2001; Bosc et al. Chaudhry et.al. 1999; Saygm et al. 
1999; Turgay etal.2006). 

Query operation was mostly used in choosing 
the related information web environment (Jim & 
Suciu, 2001; He et al. (2004). Data mining approach 
was used in dynamic site discovery process by the 
data preparation and type recognition approaches in 
complex matching schema with correlation values 
in query interfaces and query schemas (Nambiar & 
Kambhampati, 2006; Necib & Freytag, 2005). Query 
processing within peer to peer network structure with 
SQL structure was discussed generally (Cybenko et 
al. 2004; Bernstein et al. 1981). Query processing and 
database was reviewed with relational database (Genet 
& Hinze, 2004; Halashek- Wiener et al., 2006). Fuzzy 
set was proposed by Zadeh (1965) and the division of 
the features into various linguistic values was widely 
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used in pattern recognition and in the fuzzy inference 
system. Kubat, et al. (2004) reviewed the frequency 
of the fuzzy logic approach in operations research 
methods as well as artificial intelligence ones in dis- 
crete manufacturing. Data processing process within 
the multi-agent systems can be grouped as static and 
dynamic. While the evaluation process of existing data 
by the system can be referred to as a static structure, the 
evaluation process of new data or possible data within 
the system can be referred to as a dynamic structure. 
The studies on the static structure can be expressed 
as database management's query process (McClean, 
Scotney, Rutjes & Hartkamp, 2003) and the studies on 
the dynamic structure can be expressed as the whole of 
the agent system (Purvia, Cranefield, Bush & Carter, 
2000; Hoschek, 2002; Doherty, Lukaszewicz, & Szalas, 
2004, Turgay, 2006) 



ticular, a well-defined query answering process within 
multi agent systems provides communication among 
agents, the sharing of knowledge and the effective 
performance of data processing process and learning 
activities. The system is able to process incomplete 
or fuzzy knowledge intelligently with the fuzzy SQL 
query approach. 

The distributed query answering mechanism was 
proposed as a cooperative agent-based solution for 
information management with fuzzy SQL query. A 
multi-agent approach to information management 
includes some features such as: 

Concurrency 
Distributed computation 
Modularity 
Cooperation 




AGENT BASED QUERY ANSWERING 
SYSTEM 

The query process lists the knowledge with desired 
characteristics in compliance with the required con- 
dition while query answering finds the knowledge 
conforming to the required conditions and responds to 
the related message in the form of knowledge. In par- 



Figure 1 represents each agent's query answering 
mechanism. When the data is received by the system, 
the query variables are chosen by query and then the 
data related with fuzzy SQL are suggested. The obtained 
result is represented as the answer knowledge in the 
agent and thus the process is completed. 

The data are classified by the fuzzy query ap- 
proach, depending on fuzzy relations and importance 
levels. The rule base of the system is formed after a 



Figure 1. Model driven framework for query answering mechanism in a multi-agent system 
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query and evaluation. The task base structure of the 
system is updated by the mechanism in line with the 
obtained fuzzy rules, and then, it is ensured that the 
system makes an appropriate and right decision and 
acts intelligently. 



Step4: determines the knowledge in compliance with 
the criteria through fuzzy SQL commands 

Step5: sends the obtained task or rule to the related 
agent 

Step6: performs the answering operation 



Operation Mechanism of Agent Based 
Fuzzy Query Answering System 

The agent does the following: 

Stepl: receives the task knowledge from the related 

agent 
Step2: does the fuzzification of knowledge 
Step3: determines fuzzy grade values according to 

knowledge features 



The agent based query answering system involves 
three main stages: knowledge processing, query pro- 
cessing and agent learning (see Figure2). The operation 
types of these stages are given in detail below. 

Knowledge Processing 

This is the stage where the knowledge is received by 
the agent from the external environment and necessary 
preparations are made before query. The criteria and 



Figure 2. Suggested system model for each agent 
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keywords to be used in evaluating the received data 
are defined in this stage. This stage can also be called 
pre-query. The keywords, concepts, attribute and re- 
lationship knowledge to be analysed by the agent are 
determined in this stage before query. 

In this system, the behaviour structure of intelligent 
query answering system is formed. During the system 
modelling, the perception model considered being 
coming signal, data and knowledge from the external 
environment for a more understandable structure in 
learning module plays an important role. Coming from 
the external environment and called the input modelling; 
<A. x , A., 0> is defined as the perception set. Agent i, 
x perception coming from the external environment, 
refer to the A. . Table 1 includes the nomenclature of 

1,X 

agent based query answering system. The multi-agent 
system consists of more than one agent. The agent set 
isA={A 1 ,A 2 ,...,A}. The knowledge set is K={K 1? K 2 , 
. . . ,K } The knowledgebase is <Definition of Knowledge, 
Attribute, Dependency Situation, Agent > (in Table 1 
and Figure 3). 

The rule set is R={R 1? R 2 , ...,R X }> The rule base is 
<Definition of Rule, Attribute, Dependency Situation, 
Agent > . The task set is T={T 1 , T 2 , ...,T }. The task 
base is <Definition of Task, Attribute, Dependency 
Situation, Agent > . 

When data arrives from the external environment, 
it is perceived as input : <A. , A., 0> When "x" is 



perceived by Agent i, it is referred to as A. x This input 
can also be used in knowledgebase, rule base and task 
base. The following goals that were determined as 
a result of the process and the evaluation of the 
information coming to the knowledge-base should 
have been achieved in the mechanism of intelligent 
query answering. 

Goal definition 
Data selection 
Data preparation 

Query Processing 

The agent performs two types of query in the process 
of defining keywords, concepts or attributes during 
knowledge processing. The first is external query, which 
is realized among the agents, while the second is the 
internal query, where the agent scans the knowledge 
within itself. During these query processes, the fuzzy 
SQL approach is applied. 

Feature-Attribute At and relation Re are elements 
formed among the components within the system. 
These elements are the databases of knowledgebase, 
rule base and task base. While attribute refers to agent 
specifications, Resource includes not only raw data 
externally received but also knowledgebase, rule base 
and task base which each agent possesses. 




Table 1. The nomenclature of agent based query answering system 



A 


^ i agent set {Ai,A 2 ,..., Ai} 


T 


-> j task set in {Ti, T 2 , ...,Tj} 


A ijX 


-> i agents x percept 


UiT=i Tjk 


-> i agent's j task sets refers to continuing subsets from k to m situation 


-L^m 


-> i agents m learning situation 


Qi,n 


-> i agents n querying situation 


Ati 


-> i agents attribute situation 


Ri,r 


-> i agent's r decision situation 


Ki,y 


-> i agent's y knowledgebase 


Ru 


-> i agent's x rule base 


Ti,t 


-> i agent's t task base 



927 



Intelligent Query Answering Mechanism in Multi Agent Systems 



A={At, Re(K. y) R. x , T. ,)} 

Let P(At) denote the set of all possibility distribu- 
tions that may be defined over the domain of an attribute 
At. A fuzzy relation R with u schema A ± , A 2 , . . .,A n , 
where A is an attribute is defined as R=P(At 1 )xP(At 2 ) 
x . . . xP(At n ) xD, where D is a system-supplied attribute 
for membership degree with a domain [0,1] and x 
denotes the cross product. 

Each data value V of the attribute is associated with 
a possibility distribution defined over the domain of 
the attribute and has a membership function denoted 
by |u v (x). If the data value is crisp, its possibility dis- 
tribution is defined by 



m»= 



if x = v 
otherwise 



(1) 



Like standard SQL, queries in fuzzy SQL are speci- 
fied in select statement of the following form: 



SELECT 


Attributes 


FROM 


Relations 


WHERE 


Selection Conditions 



The semantics of a fuzzy SQL query is defined 
based on satisfaction degrees of query conditions. 
Consider a predicate X0Y in a WHERE clause. The 
satisfaction degree, denoted by d(X0Y), is evaluated 
for values of X and Y. Let the value of X be v, and that 



ofYof v 2 . Then, 



d(X0Y)=max x ^ 
(X,Y)) 



(min(|i vl (X), n v2 (Y), \i e 
(2) 



where X and Y are crisp values in the common domain 
over which v 1 and v 2 are defined(Yang et al., 2001). 
Function is a function that compares the degrees 
in terms of satisfaction among the variables. When 
the satisfaction degree is evaluated for X and Y the 
former takes the value of v, while the latter takes the 



value of v 2 . 



i, 



As shown in Figure 2, bids are taken as a set, the 
frequencies of the received bids are fixed and then the 
bids are decomposed into groups. The decomposed 



bids are included into databases of the multi-agent 
system. The information in databases is fuzzified and 
the interrelation between them is determined in terms 
of weight and importance level. 

Agent Learning Process 

This is a process where the system learns the knowl- 
edge obtained as a result of query as a rule or task. The 
system fulfils not only the task but also the learning 
process (in Figure 3). Learning process is acquired and 
the data from the external transition is processed by the 
agent system of the defined aim during the activities. 
Learning algorithm shows the variability of the system 
status(in Table 2). 

In the learning process with the help of the 
query processing, candidate rules are determined 
by taking the fuzzy dimension attributes and the 
attribute measures into consideration. Therefore, it 
would be true to say that a hierarchical order from 
knowledge-base to rule-base and from rule- base to 
task-base is available in the system. 

Algorithm Learning Analysis 

Input: A relational view that contains a set of records 

and the questions for influence analysis. 
Output: An efficient association rule. 
Stepl : Specifies the fuzzy dimension attribute and the 

measure attribute. 
Step2: Identifies the fuzzy dimension item sets and 

calculates the support coefficient 
Step3: Identifies the measure item sets and calculates 

the support coefficient. 
Step4: Constructs sets of candidate rules, and computes 

the confidence and aggregate value. 
Step5: Obtains a rule at the granularity level with 

greatest confidence, and forms a rule at the ag- 
gregation level with largest abstract value of the 

measure attribute. 
Step6: Computes the assertions at different levels, exits 

if comparable (i.e., there is no inconsistency found 

in semantics at different levels). 
Step7: Generates rules from the refined measure item 

sets and forms the framework of the rule. 
Step8: Constructs the final rule as a task for related 

agent. 
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Table 2. The query answering mechanism's learning analysis algorithm 




Algorithm Learning Analysis 

Input: A relational view that contains a set of records and the questions for influence 

analysis. 

Output: An efficient association rule. 

Stepl: Specifies the fuzzy dimension attribute and the measure attribute. 

Step2: Identifies the fuzzy dimension item sets and calculates the support coefficient 

Step3: Identifies the measure item sets and calculates the support coefficient. 

Step4: Constructs sets of candidate rules, and computes the confidence and aggregate 

value. 

Step5: Obtains a rule at the granularity level with greatest confidence, and forms a rule at 

the aggregation level with largest abstract value of the measure attribute. 

Step6: Computes the assertions at different levels, exits if comparable (i.e., there is no 

inconsistency found in semantics at different levels). 

Step7: Generates rules from the refined measure item sets and forms the framework of 

the rule. 

Step8: Constructs the final rule as a task for related agent. 



Figure 3. The way the input perceived by the agent is processed 
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FUTURE TRENDS 

Future tasks of the system will be realized when the 
system performs query answering more quickly thanks 
to the distributed, autonomous, intelligent and commu- 
nicative agent structure of the suggested agent based 
fuzzy query answering system. In fuzzy approach, the 
system will primarily examine and group the relational 
database in databases of the agents with the fuzzy logic 
and then will shape the rule base of the system by ap- 
plying the fuzzy logic method to these data. After the 
related rule is chosen, the rule base of the system will 
be designed and the decision mechanism of the system 
will operate. Therefore, relational database structure 
and system behaviour are important in determining 
the first peculiarity of the system and in terms of data 
clearing. 

For future research, it is noted that the design of 
fuzzy databases involves not just modelling the data 
but also modelling operations on the data. Relational 
databases support only limited data types, while fuzzy 
and possibility databases allow a much larger number 
of comparatively complex data types (e.g., possibility 
distributions). This suggests that it might be fruitful to 
employ object-oriented database technology to allow 
explicit modelling of complex data types. 

The incorporation of fuzziness into distributed 
events can be performed as a future study. Finally, 
due to frequent changes in the positions and status of 
objects in an active mobile database environment, the 
issue of temporality should be considered by adapting 
the research results of temporal database systems area 
into active mobile databases. 



CONCLUSION 

This paper discusses a variety of issues in adapting 
fuzzy database concepts to an active multi agent 
database system which incorporates active rules in a 
multi computing environment. This study shows how 
fuzziness can be introduced to different aspects of rule 
execution from event detection to coupling modes. 
As an initial step, membership degree calculation for 
various types of composite events has been explained. 
Dynamic determination of coupling modes has been 
done by using the strengths of events and reliabilities 
of conditions which are calculated via membership 
functions. Strengths of events and condition reliabili- 



ties have been shown to be useful for condition and 
action status, as well. The partitioning of the rule set 
into multi agent system events has also been discussed 
as an example of inter-rule fuzziness. Similarity based 
event detection has been introduced to active multi 
agent databases, which is an important contribution 
from the perspective of performance. 
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KEY TERMS 

Agent : A system that fulfils the independent func- 
tions, perceives the outer world and establishes the 
linking among the agents through its software. 

Flexible Query: Incorporates some elements of the 
natural language so as to make a possible simple and 
powerful expression of subjective information needs. 

Fuzzy SQL(Structur al Query Language) : It is an 

extension of the SQL language that allows us to write 
flexible conditions in our queries. The FSQL allows us 
to use linguistic labels defined on any attribute. 
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Fuzzy SQL Query: Fuzzy SQL allows the system 
to make flexible queries about crisp or fuzzy attributes 
in fuzzy relational data or knowledge. 

Intelligent Agent: It consists of a sophisticated 
intelligent computer program; which is acting of situ- 
ated, independent, reactive, proactive, flexible, recovers 
from failure and interacts with other agents. 

Multi- Agent System: It is a flexible incorporated 
network of software agents that interact to solve the 



problems that are beyond the individual capacities or 
knowledge of each problem solver. 

Query: Caries out the scanning of the data with 
required specifications. 

Query Answering: Answers a user query with the 
help of a single or multi-database in the multi agent 
system. 

System: A set of components considered to act as 
a single goal-oriented entity. 
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INTRODUCTION 

The Artificial Neural Networks (ANNs) are based on 
the behaviour of the brain. So, they can be considered 
as intelligent systems. In this way, the ANNs are con- 
structed according to a brain, including its main part: 
the neurons. Moreover, they are connected in order to 
interact each other to acquire the followed intelligence. 
And finally, as any brain, it needs having memory, which 
is achieved in this model with their weights. 

So, starting from this point of view of the ANNs, we 
can affirm that these systems are able to learn difficult 
tasks. In this article, the task to learn is to distinguish 
between the presence or not of a reflected signal called 
target in a Radar environment dominated by clutter. 
The clutter involves all the signals reflected from 
other objects in a Radar environment that are not the 
desired target. Moreover, the noise is considered in 
this environment because it always exists in all the 
communications systems we can work with. 



BACKGROUND 

The ANNs, as intelligent systems, are able to detect 
known targets in adverse Radar conditions. These 
conditions are related with one of the most difficult 
clutter we can find, the coherent Weibull clutter. It is 
possible because ANNs trained in a supervised way can 



approximate the Neyman-Pearson (NP) detector (De la 
Mata-Moya, 2005, Vicen-Bueno, 2006, Vicen-Bueno, 
2007), which is usually used in Radar systems design. 
This detector maximizes the probability of detection 
(Pd) maintaining the probability of false alarm (Pfa) 
lower than or equal to a given value (VanTrees, 1997). 
The detection of targets in presence of clutter is the 
main problem in Radar detection systems. Many clutter 
models have been proposed in the literature (Cheikh, 
2004), although one of the most used models is the 
Weibull one (Farina, 1987a, DiFranco, 1980). 

The research shown in (Farina, 1987b) set the 
optimum detector for target and clutter with arbi- 
trary Probability Density Functions (PDFs). Due to 
the impossibility to obtain analytical expressions for 
the optimum detector, only suboptimum solutions 
were proposed. The Target Sequence Known A Priori 
(TSKAP) detector is one of them and is taken as refer- 
ence for the experiments. Also, these solutions convey 
implementation problems, some of which make them 
non-realizable. 

As mentioned above, one kind of ANNs, the Mul- 
tiLayer Perceptron (MLP), is able to approximate 
the NP detector when it is trained in a supervised way 
to minimize the Mean Square Error (MSE) (Ruck, 
1990, Jarabo, 2005). So, MLPs have been applied 
to the detection of known targets in different Radar 
environments (Gandhi, 1997, Andina, 1996). 
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INTELLIGENT RADAR DETECTORS 
BASED ON ARTIFICIAL NEURAL 
NETWORKS 

This section starts with a discussion of the models 
selected for the target, clutter and noise signals. For 
these models, the optimum and suboptimum detec- 
tors are presented. These detectors will be taken as a 
reference for the experiments. After, it is presented the 
intelligent detector proposed in this work. This detector 
is based on intelligent systems like the ANNs, and a 
further analysis of its structure and parameters is made. 
Finally, several results are obtained for the detectors 
under study in order to analyze their performances. 

Signal Models: Target, Clutter and Noise 

The Radar is assumed to collect N pulses in a scan, so 
input vectors (z) are composed of N complex samples, 
which are presented to the detector. Under hypothesis 
HO (target absent), z is composed of N samples of clut- 
ter and noise. Under hypothesis HI (target present), 
a known target characterized by a fixed amplitude 
(A) and phase (0) for each of the N pulses is summed 
up to the clutter and noise samples. Also, a Doppler 
frequency in the target model of 0,5 * PRF is assumed, 
where PRF is the Pulse Repetition Frequency of the 
Radar system. 

The noise is modelled as a coherent white Gauss- 
ian complex process of unity power, i.e., a power of 
Vi for the quadrature and phase components, respec- 
tively. The clutter is modelled as a coherent correlated 
sequence with Gaussian Autocorrelation Function 
(ACF), whose complex samples have a modulus with 
aWeibullPDF: 



p(jw|)=a£> a |w| a e 



(i) 



where |w| is the modulus of the coherent Weibull se- 
quence and a and b are the skewness (shape) and scale 
parameters of a Weibull distribution, respectively. 

The NxN autocorrelation matrix of the clutter is 
given by 



( M c\ k =p c pre 



|2 j\2n(h-k} 



(2) 



where the indexes h and k varies from 1 to N, P is the 

' c 

clutter power, p c is the one-lag correlation coefficient 
and f c is the Doppler frequency of the clutter. 

The relationship between the Weibull distribution 
parameters and P is 



a U/ 



(3) 



where T( ) is the Gamma function. 

The model used to generate coherent correlated 
Weibull sequences consists of two blocks in cascade: 
a correlator filter and a Nonlinear MemoryLess Trans- 
formation (NLMLT) (Farina, 1987a). To obtain the 
desired sequence, a coherent white Gaussian sequence 
is correlated with the filter designed according to (2) 
and (3). The NLMLT block, according to (1), gives 
the desired Weibull distribution to the sequence. So, in 
that way, it is possible to obtained a coherent sequence 
with the desired correlation and PDF. 

Taking into consideration that the complex noise 
samples are of unity variance (power), the following 
power relationships are considered for the study: 

Signal to Noise Ratio: SNR = 101og 10 (A 2 ) 
Clutter to Noise Ratio: CNR = 101og 10 (P c ) 

Neyman-Pearson Detectors: Optimum 
and Suboptimum Detectors 

The problem of optimum Radar detection of targets 
in clutter is explored in (Farina, 1 987a) when both are 
time correlated and have arbitrary PDFs. The optimum 
detector scheme is built around two non-linear esti- 
mators of the disturbances in both hypotheses, which 
minimize the MSE. The study of Gaussian correlated 
targets detection in Gaussian correlated clutter plus 
noise is carried out, but for the cases where the hypoth- 
esis are non-Gaussian distributed, only suboptimum 
solutions are studied. 

The proposed detectors basically consist of two 
channels. The upper channel is matched to the condi- 
tions that the sequence to be detected is the sum of the 
target plus clutter in presence of noise (hypothesis 
HI). While the lower one is matched to the detection 
of clutter in presence of noise (hypothesis HO). 

For the detection problem considered in this paper, 
the suboptimum detection scheme (TSKAP) shown 
in figure 1 is taken. Considering that the CNR is very 
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Figure 1. Target sequence known a priori detector 
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high (CNR»1), the inverse of theNLMLT is assumed 
to transform the Weibull clutter in Gaussian, so the 
Linear Prediction Filter (LPF) is a N-l order linear 
one. Then, the NLMLT transforms the filter output in 
a Weibull sequence. Besides being suboptimum, this 
scheme presents two important drawbacks: 

1 . The prediction filters have N- 1 memory cells that 
must contain the suitable information to predict 
correct values for the N samples of each input 
pattern. So N+(N- 1 ) pulses are necessary to decide 
if the target is present or not. 

2. The target sequence must be subtracted from the 
input of the HI channel. 

There is no sense in subtracting the target com- 
ponent before deciding if this component is present 
or not. So, in practical cases, it makes this scheme 
non-realizable. 

Intelligent Radar Detectors 

In order to overcome the drawbacks of the scheme 
proposed in the previous section, a detector based on a 
MLP with log-sigmoid activation function in its hidden 
and output neurons with hard limit threshold after its 
output is proposed. Also, as MLPs have been probed 
to approximate the NP detector when minimizing the 
MSE (Jarabo, 2005), it can be expected that the MLP- 
based detector outperforms the suboptimum scheme 
proposed in (Farina, 1987a). 

MLPs have been trained to minimize the MSE using 
two algorithms: the back-propagation (BP) with vary- 



ing learning rate and momentum (Haykin, 1999) and 
the Levenberg-Marquardt (LM) with varying adaptive 
parameter (Bishop, 1995). While BP is based on the 
steepest descent method, the LM is based on the Newton 
method, which is designed specifically for minimizing 
the MSE. For MLPs which have up to few hundred of 
weights (W), the LM algorithm is more efficient than 
the BP one with variable learning rate or the conjugate 
gradient algorithms, being able to converge in many 
cases when the other two algorithms fail (Hagan, 1 994). 
The LM algorithm uses the information (estimation of 
the WxW Hessian matrix) of the error surface in each 
iteration to find the minimum. It makes this algorithm 
faster than the previous ones. 

Cross-validation is used with both training algo- 
rithms, where training and validation sets are syntheti- 
cally generated. Moreover, a new set (test set) of patterns 
is generated to test the trained MLP for estimating 
the Pfa and Pd using Montecarlo simulation. All the 
patterns of the three sets are generated under the same 
conditions (SNR, CNR and a parameters of the Radar 
problem) in order to study the capabilities of the MLP 
plus hard limit thresholding working as a detector. 

MLPs are initialized using the Nguyen- Widrow 
method (Nguyen, 1999) and, in all cases, the training 
process is repeated ten times to guarantee that the per- 
formance of all the MLPs is similar in average. Once 
all the MLPs are trained, the best MLP in terms of the 
estimated MSE with the validation set is selected, in 
order to avoid the problem of keeping in local minima 
at the end of the training. 

The architecture of the MLP considered for the 
experiments is I/H/O, where I is the number of MLP 
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inputs, H is the number of hidden neurons in its hid- 
den layer and O is the number of MLP outputs. As the 
MLPs work with real arithmetic, if the input vector 
(z) is composed of N complex samples, the MLP will 
have 2N inputs (N in phase and N in quadrature com- 
ponents). The number of MLP independent elements 
(weights) to solve the problem is W=(I+ 1 )-H+(H+ 1 ) O, 
including the bias of each neuron. 

Results 

The performance of the detectors exposed in the 
previous sections is shown in terms of the Receiver 
Operating Characteristics (ROC) curves. They give 
the estimated Pd for a desired Pfa, which values are 
obtained varying the output threshold of the detector. 
The experiments presented are made for an integration 
of two pulses (N=2). So, in order to test correctly the 
TSKAP detector, observation vectors (also called pat- 
terns during the text) of length 3 (N+(N-1)) complex 
samples are generated, due to memory requirements 
of the TSKAP detector (N-l pulses). 

The a priori probabilities of HO and HI hypothesis 
are supposed to be the same. Three sets of patterns 
are generated for each experiment: train, validation 
and test sets. The first and the second ones have 5-10 3 
patterns, respectively. The third one has 2.5- 10 6 pat- 
terns, so the error in the estimation of the Pfa and the 
Pd is lower than 10% of the estimated values in the 
worst case (Pfa=10~ 4 ). The patterns of all the sets are 
synthetically generated under the same conditions. 
These conditions involve typical values (Farina, 1 987a, 
DiFranco, 1980, Farina, 1987b) for the SNR (20 dB), 
the CNR (30 dB) and the a (a=1.2) parameter of the 
Weibull-distributed clutter. 

The MLP architecture used to generate the MLP- 
based detector is 6/H/l. The number of MLP outputs 
(0=1) is established by the problem (binary detec- 
tion). The number of hidden neurons (H) is studied 
in this work. And the number of MLP inputs (1=6) is 
established according to the next criterion. A total of 6 
inputs (2(N+(N- 1 ))) are selected when the MLP -based 
detector wants to be compared with the T SKAP detector 
in the same conditions, i.e., when both detectors have 
the same available information (3 pulses for an integra- 
tion of N=2 pulses). Because of the TSKAP detector 
memory requirements, this case is considered. 

Figure 2 shows the results of a study when 3 pulses 
are used to take the final decision by the MLP -based 



detector according to the criterion exposed above. The 
study shows the influence of the training algorithm and 
the MLP size, i.e. , the number of independent elements 
(W weights) that has the ANN to solve the problem. 
For the case of study, two important aspects have to be 
noted. The first one is related with the training algo- 
rithm. As can be observed, the performance achieved 
with a low size MLP (6/05/1) is very similar for both 
training algorithms (LM and BP). But when the MLP 
size is greater, for instance, 6/10/1, the performance 
achieved with the LM algorithm is better than the 
performance achieved with the BP one. It is due to 
the LM algorithm is more efficient than the BP one 
finding the minimum of the error surface. Moreover, 
the MLP training with LM is faster than the training 
with BP, because the number of training epochs can be 
reduced in an order of magnitude. The second aspect 
is related with the MLP size. As can be observed, no 
performance improvement is achieved when 20 or more 
hidden neurons are used comparing both algorithms 
as occurred with 10 hidden neurons. Moreover, from 
20 (W=121 weights) to 30 (W=181 weights) hidden 
neurons, the performance tends to a maximum value 
(independently of the training algorithm used), i.e., 
almost no performance improvement is achieved with 
more weights. So, an MLP-based detector with 20 
hidden neurons achieves an appropriate performance 
with low complexity. 

A comparison between the performances achieved 
with the TSKAP detector and the MLP-based detector 
of size 6/20/1 trained with BP and LM algorithms is 
shown in figure 3 . Two differences can be observed. The 
first one is that the MLP-based detector performance 
is practically independent of the training algorithm, 
comparing their results with the ones obtained for the 
TSKAP detector. And the second one is that the 6/20/1 
MLP-based detector is always better than the TSKAP 
detector when they are compared in the same conditions 
of availability of information, i.e., with the availability 
of 3 (N+(N- 1 )) pulses to decide. Under these conditions 
and comparing figures 2 and 3, it can be observed that 
a 6/05/1 MLP-based detector is enough to overcome 
the TSKAP one. 

The appreciated differences between the TSKAP and 
MLP-based detectors appear because the first one is a 
suboptimum detector and the second one approximates 
the optimum one, but it will be always worse than the 
optimum detector. It can not be demonstrated because 
an analytical expression for the optimum detector 
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Figure 2. MLP-based detector performances for different structure sizes (6/H/l) and different training algo- 
rithms: (a) BP and (b) LM 
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Figure 3. TSKAP and MLP-based detectors perfor- 
mances for MLP size 6/20/1 trained with BP and LM 
algorithms 




related with the research in Radar detectors. In the first 
trend, it is possible to emphasize the research in areas 
like ensembles of ANNs, committee machines based 
on ANNs and others way to combine the intelligence 
of different ANNs like the MLPs, the Radial Basis 
Functions and others. Moreover, new trends try to 
find different ways to train ANNs. In the second trend, 
several researchers are trying to find different ways to 
create radar detectors in order to improve their perfor- 
mances. Moreover, several solutions are proposed, but 
they depend on the Radar environment considered. So, 
detectors based on signal processing tools seem to be the 
most appropriated, but the intelligent detector exposed 
here is a new way of working, which can brings good 
solutions to these problems. This is possible because 
of the intelligence of the ANNs to adapt to almost any 
kind of Radar conditions and problems. 



can not be obtained detecting targets in presence of 
Weibull-distributed clutter. 



FUTURE TRENDS 

Two different future trends can be mentioned. The 
first one is related with ANNs and the second one is 



CONCLUSION 

After the developed study, several conclusions can be 
set. The LM training algorithm achieves better MLP- 
based detectors than the BP one. No performance 
improvement is obtained for training MLPs with LM 
or BP algorithms when their sizes are greater than 
6/20/1 . But, the great advantage of the LM one against 
the BP one is its fastest training for low size MLPs (a 
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few hundred of weights), i.e., the MLPs considered in 
this study. Finally, the MLP -based detector works bet- 
ter than the TSKAP one in cases of working with the 
same available information (N+(N-1)=3), because the 
memory requirements of the TSKAP one. In those cases, 
low complexity MLP -based detectors can be obtained 
because a 6/05/ 1 MLP has enough intelligence to obtain 
better performance than the TSKAP one. 
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KEY TERMS 

Artificial Neural Networks (ANNs): A network 
of many simple processors ("units" or "neurons") that 
imitates a biological neural network. The units are 
connected by unidirectional communication channels, 
which carry numeric data. Neural networks can be 
trained to find nonlinear relationships in data, and are 
used in applications such as robotics, speech recogni- 
tion, signal processing or medical diagnosis. 

Backpropagation Algorithm: Learning algorithm 
of ANNs, based on minimising the error obtained from 
the comparison between the ANN outputs after the 
application of a set of network inputs and the desired 
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outputs. The update of the weights is done according 
to the gradient of the error function evaluated in the 
point of the input space that indicates the input to the 

ANN. 

Knowledge Extraction: Explicitation of the internal 
knowledge of a system or set of data in a way that is 
easily interpretable by the user. 

Intelligence: It is a property of mind that encom- 
passes many related abilities, such as the capacities to 
reason, plan, solve problems, think abstractly, compre- 
hend ideas and language, and learn. 

Levenberg-Marquardt Algorithm: Similar to the 
Backpropagation algorithm, but with the difference that 
the error is estimated according to the Hessian Matrix. 
This matrix gives information of several directions 



where to go in order to find the minimum of the error 
function, instead of the local minimum one that gives 
the backpropagation algorithm. 

Probability Density Function: The statistical func- 
tion that shows how the density of possible observations 
in a population is distributed. 

Radar: It is the acronym of Radio Detection and 
Ranging. In few words, a Radar emits an electromag- 
netic wave that is reflected by the target and others 
objects present in its observation space. Finally, the 
Radar receives these reflected waves (echoes) to 
analyze them in order to decide whether a target is 
present or not. 




939 



940 



Intelligent Software Agents Analysis in 
E-Commerce I 



Xin Luo 

The University of New Mexico, USA 

Somasheker Akkaladevi 

Virginia State University, USA 



INTRODUCTION 

Equipped with sophisticated information technology 
infrastructures, the information world is becoming more 
expansive and widely interconnected. Internet usage 
is expanding throughout the web-linked globe, which 
stimulates people's need for desired information in a 
timely and convenient manner. Electronic commerce 
activities, powered by Internet growth, are increasing 
continuously. It is estimated that online retail will reach 
nearly $230 billion and account for 10% of total U.S. 
retail sales by 2008 (Johnson et al. 2003). In addition, 
e-commerce entailing business-to-business (B2B), 
business-to-customer (B2C) and customer-to-customer 
(C2C) transactions is spawning new markets such as 
mobile commerce. 

By increasing the degree and sophistication of the 
automation, commerce becomes much more dynamic, 
personalized, and context sensitive for both buyers and 
sellers. Software agents were first used several years ago 
to filter information, match people with similar interests, 
and automate repetitive behavior (Maes et al. 1999). 
In recent years, agents have been applied to the arena 
of e-commerce, triggering a revolutionary change in 
the way we conduct online transactions in B2B, B2C, 
and C2C. Researchers argue that the potential of the 
Internet for transforming commerce is largely unreal- 
ized (Begin et al. 2002; Maes et al. 1999). Further, 
He and Jennings noted that a new model of software 
agent is needed to achieve the degree of automation 
and move to second generation e-commerce 1 applica- 
tions (He et al. 2003). This is due to the predicament 
that electronic purchases are still largely unautomated. 
Maes et al. (1999) also addressed that, even though 
information is more easily accessible and orders and 
payments are dealt with electronically, humans are still 
in the loop in all stages of the buying process, which 
inevitably increase the transaction costs. Undoubtedly, 



a human buyer is still responsible for collecting and 
interpreting information on merchants and products, 
making decisions about merchants and products, and 
ultimately entering purchase and payment information. 
Additionally, Jennings et al. ( 1 998) confirmed that com- 
merce is almost entirely driven by human interactions 
and further argued that there is no reason why some 
commerce cannot be automated. 

This unautomated loop requires a lot of time and 
energy and results in inefficiency and high cost for both 
buyers and sellers. To automate time-consuming tasks, 
intelligent software agent (ISA) technology can play 
an important role in online transaction and negotiation 
due to its capability of delivering unprecedented levels 
of autonomy, customization, and general sophistica- 
tion in the way e-commerce is conducted (Sierra et al. 
2003). Systems containing ISAs have been developed 
to automate the complex process of negotiating a deal 
between a buyer and a seller. An increasing number 
of e-commerce agent systems are being developed 
to support online transactions that have a number of 
variables to consider and to aim for a win-win result 
for sellers and buyers. 

In today's e-commerce arena, systems equipped 
with ISAs may allow buyers and sellers to find the 
best deal taking into account the relative importance 
of each factor. Advanced systems of e-commerce that 
embody ISA technologies are able to perform a num- 
ber of queries and to process phenomenal volumes 
of information. ISAs reduce transaction costs by col- 
lecting information about services and commodities 
from a lot of firms and presenting only those results 
with high relevance to the user. ISA technologies help 
businesses automate information transaction activity, 
largely eliminate human intervention in negotiation, 
lower transaction and information search cost, and 
further cultivate competitive advantage for companies. 
Therefore, ISAs can free people to concentrate on the 
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issues requiring true human intelligence and interven- 
tion. Implementing the personalized, social, continu- 
ously running, and semi-autonomous IS A technologies 
in business information systems, the online business 
can become more user- friendly, semi-intelligent, and 
human-like (Pivk 2003). 



LITERATURE REVIEW 

A number of scholars have defined the term intelligent 
software agent. Bradshaw (1997) proposed that one 
person's intelligent agent is another person's smart 
object. Jennings and Wooldridge (1 995) defined agents 
as a computer system situated in some environment that 
is capable of autonomous action in this environment 
to meets its design objective. Shoham (1997) further 
described an ISA as a software entity which functions 
continuously and autonomously in a particular environ- 
ment, often inhabited by other agents and processes. In 
general, an ISA is a software agent that uses Artificial 
Intelligence (Al) in the pursuit of the goals of its clients 
(Croft 2002). It can perform tasks independently on 
behalf of a user in a network and help users with infor- 
mation overload. It is different from current programs 
in terms of being proactive, adaptive, and personalized 
(Guttman et al. 1998b). Also, it can actively initiate 
actions for its users according to the configurations set 
by the users; it can read and understand user's prefer- 
ences and habits to better cater to user's needs; it can 
provide the users with relevant information according 
to the pattern it adapts from the users. 

ISA is a cutting-edge technology in computational 
sciences and holds considerable potential to develop 
new avenues in information and communication 
technology (Shih et al. 2003). It is used to perform 
multi-task operations in decentralized information 
systems, such as the Internet, to conduct complicated 
and wide-scale search and retrieval activities, and assist 
in shopping decision-making and product information 
search (Cowan et al. 2002). ISA's ability of performing 
continuously and autonomously stems from human 
desire in that an agent is capable of operating certain 
activities in a flexile and intelligent manner responsive 
to changes in the environment without constant human 
supervision. Over a long period of time, an agent is 
capable of adapting from its previous experience and 
would be able to inhabit an environment with other 



agents to communicate and cooperate with them to 
achieve tasks for human. 

Intelligent Agent Taxonomy and 
Typology 

Franklin and Grasser (1996) proposed a general tax- 
onomy of agent (see Figure 1). 

This taxonomy is based on the fact that IS A technolo- 
gies are implemented in a variety of areas, including 
biotechnology, economic simulation and data-mining, 
as well as in hostile applications (malicious codes), 
machine learning and cryptography algorithms. In 
addition, Nwana (1996b) proposed the agent typology 
(see Figure 2) in which four types of agents can be 
categorized: collaborative agents, collaborative learn- 
ing agents, interface agents and smart agents. These 
four agents have different congruence amid learning, 
autonomy, and cooperation and therefore tend to ad- 
dress different sides of this topology in terms of the 
functionality. 

According to Nwana (1996b), collaborative agents 
emphasize more autonomy and cooperation than learn- 
ing. They collaborate with other agents in multi-agent 
environments and may have to negotiate with other 
agents in order to reach mutually acceptable agree- 
ments for users. Unlike collaborative agents, interface 
agents emphasize more autonomy and learning. They 
support and provide proactive assistance. They can 
observe user's actions in the interface and suggest 
better ways for completing a task for the user. Also, 
interface agents' cooperation with other agents is typi- 
cally limited to asking for advice (Ndumu et al. 1997). 



Figure 1. Franklin and Grasser's agent taxonomy 
(Source: Franklin & Grasser. 1996) 
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Figure 2. A Part View of Agent Typology Source: Nwana (1996b) 
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The benefits of interface agents include reducing user 's 
efforts in repetitive work and adapting to their user's 
preferences and habits. Smart agents are agents that 
are intelligent, adaptive, and computational (Carley 
1998). They are advanced intelligent agents summing 
up the best capabilities and properties of all presented 
categories. 

This proposed typology highlights the key contexts 
in which the agent is used in AI literature. Yet Nwana 
(1996b) argued that agents ideally should do all three 
equally well, but this is the aspiration rather than the 
reality. Furthermore, according to Nwana (1996b) 
and Jennings and Wooldridge (1998), five more agent 
types could be derived based on the typology, from a 
panoramic perspective (see Figure 3). 

In this proposed typology, mobile agents are autono- 
mous and cooperative software processes capable of 
roaming wide area networks, interacting with foreign 
hosts, performing tasks on behalf of their owners 
(Houmb 2002). Information agents can help us manage 
the explosive growth of information we are experienc- 
ing. They perform the role of managing, manipulating, 



or collating information from many distributes sources 
(Nwana 1996b). Reactive agents choose actions by 
using the current world state as an index into a table 
of actions, where the indexing function's purpose is 
to map known situations to appropriate actions. These 
types of agents are sufficient for limited environments 
where every possible situation can be mapped to an 
action or set of actions (Chelberg 2003). Hybrid agents 
adopt strength of both the reactive and deliberative 
paradigms. They aim to have the quick response time 
of reactive agents for well known situations, yet also 
have the ability to generate new plans for unforeseen 
situations (Chelberg 2003). Heterogeneous agents 
systems refer to an integrated set-up of at least two or 
more agents, which belong to two or more different 
agent classes (Nwana 1996b). 



CONCLUSION AND FUTURE WORK 

This paper explores how ISAs can automate and add 
value to e-commerce transactions and negotiations. By 



Figure 3. A panoramic overview of the different agent types (Source: Jennings & Wooldridge, 1998) 
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leveraging ISA-based e-commerce systems, companies 
can more efficiently make decisions because they have 
more accurate information and identify consumers' 
tastes and habits. Opportunities and limitations for ISA 
development are also discussed. Future technologies 
of ISAs will be able to evaluate basic characteristics 
of online transactions in terms of price and product 
description as well as other properties, such as warranty, 
method of payment, and after-sales service. Also, they 
would better manage ambiguous content, personalized 
preferences, complex goals, changing environments, 
and disconnected parties (Guttman et al. 1998a). Ad- 
ditionally, for the future trend of ISA technology de- 
ployment, Nwana (1996a) describes that "Agents are 
here to stay, not least because of their diversity, their 
wide range of applicability and the broad spectrum of 
companies investing in them. As we move further and 
further into the information age, any information-based 
organization which does not invest in agent technology 
may be committing commercial hara-kiri. " 
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ISA OPPORTUNITIES AND LIMITATIONS 
IN E-COMMERCE 

Cowan et al. (2002) argued that the human cognitive 
ability to search for information and to evaluate their 
usefulness is extremely limited in comparison to those 
of computers. In detail, it's cumbersome and time- 
consuming for a person to search for information from 
limited resources and to evaluate the information's 
usefulness. They further indicated that while people 
are able to perform several queries in parallel and are 
good at drawing parallels and analogies between pieces 
of information, advanced systems that embody ISA ar- 
chitecture are far more effective in terms of calculation 
power and parallel processing abilities, particularly in 
the quantities of material they can process (Cowan et 
al. 2002). According to Bradshaw (1997), information 
complexity will continue to increase dramatically in the 
coming decades. He further contended that the dynamic 
and distributed nature of both data and applications 
require that software not merely respond to requests 
for information but intelligently anticipate, adapt, and 
actively seek ways to support users. 

E-commerce applications based on agent-oriented 
e-commerce systems have great potential. Agents can 
be designed using the latest web-based technologies, 
such as Java, XML, and HTTP, and can dynamically 
discover and compose E-services and mediate inter- 
actions to handle routine tasks, monitor activities, set 
up contracts, execute business processes, and find the 
best services (Shih et al., 2003). The main advantages 
of using these technologies are their simplicity of us- 
age, ubiquitous nature, and their heterogeneity and 
platform independence (Begin and Boisvert, 2002). 
XML will likely become the standard language for 
agent-oriented E-commerce interactions to encode 
exchanged messages, documents, invoices, orders, 
service descriptions, and other information. HTTP, 



the dominant WWW protocol, can be used to provide 
many services, such as robust and scalable web serv- 
ers, firewall access, and levels of security for these 
E-commerce applications. 

Agents can be made to work individually, as well 
as in a collaborative manner to perform more complex 
tasks (Franklin and Graesser, 1996). For example, to 
purchase a product on the Internet, a group of agents 
can exchange messages in a conversation to find the 
best deal, can bid in an auction for the product, can 
arrange financing, can select a shipper, and can also 
track the order. Multi-agent systems (groups of agents 
collaborating to achieve some purpose) are critical for 
large-scale e-commerce applications, especially B2B 
interactions such as service provisioning, supply chain, 
negotiation, and fulfillment, etc. The grouping of agents 
can be static or dynamic depending on the specific need 
(Guttman et al., 1 998b). Aperfect coordination should 
be established for the interactions between the agents to 
achieve a higher-level task, such as requesting, offering 
and accepting a contract for some services (Guttman 
etal., 1998a). 

There are several agent toolkits publicly available 
which can be used to satisfy the customer requirements 
and ideally they need to adhere to standards which 
define multi -party agent interoperability. For example, 
fuzzy logic based intelligent negotiation agents can be 
used to interact autonomously and consequently, and 
save human labor in negotiations. The aim of model- 
ing a negotiation agent is to reach mutual agreement 
efficiently and intelligently. The negotiation agent 
should be able to negotiate with other such agents over 
various sets of issues, and on behalf of the real-world 
parties they represent, i.e. they should be able to handle 
multi-issue negotiations at any given time. 

The boom in e-commerce has now created the need 
for ISAs that can handle complicated online transac- 
tions and negotiations for both sellers and buyers. In 
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general, buyers want to find sellers that have desired 
products and services. And they want to find product 
information and gain expert advice before and after 
the purchase from sellers, which, in turn, want to find 
buyers and provide expert advice about their product 
or service as well as customer service and support. 
Therefore, there is an opportunity that both buyers 
and sellers can automate handling this potential trans- 
action by adopting ISA technology. The use of ISAs 
will be essential to handling many tasks of creating, 
maintaining, and delivering information on the Web. 
By implementing ISA technology in e-commerce, 
agents can shop around for their users; they can com- 
municate with other agents for product specifications, 
such as price, feature, quantity, and service package, 
and make a comparison according to user's objective 
and requirement and return with recommendations of 
purchases, which can meet those specifications; they 
can also act for sellers by providing product or service 
sales advice, and help troubleshoot customer problems 
by automatically offering solutions or suggestions; 
they can automatically pay bills and keep track of the 
payment. 

Looking at ISA development from an international 
stand point, the nature of Internet in developed countries, 
such as USA, Canada, West Europe, Japan, and Austra- 
lia, etc. and the consequent evolution of e-commerce 
as the new model provide exciting opportunities and 
challenges for ISA-based developments. Opportunities 
include wider market reach in a timely manner, higher 
earnings, broader spectrum of target and potential 
customers, and collaboration among vendors. This 
ISA-powered e-commerce arena would be different 
than our traditional commerce, because the traditional 
form of competition can give way to collaborative 
efforts across industries for adding value to business 
processes. This means that agents of different vendors 
can establish a cooperative relationship to communicate 
with each other via XML language in order to set up 
and complete transactions online. 

Technically, for instance, if an information agent 
found that the vendor is in need of more airplane tick- 
ets, it would notify a collaborative agent to search for 
relevant information regarding the ticket in terms of 
availability, price, and quantity etc. from other sources 
over the Internet. In this case, the collaborative agent 
would work with mobile agents and negotiate with other 
agents working for different vendors and obtain ticket 
information for its user. It would be able to provide 



the user with the result of the search, and, if needed, 
purchase the tickets for the user if certain requirements 
can be met. In the meantime, interface agents can 
monitor the user's reaction and decision behavior, and 
would provide the user with informational assistance 
in terms of advice, recommendation, and suggestion 
for any related and similar transactions. 

On the other hand, however, this kind of intelligent 
electronic communication and transaction is relatively 
inapplicable in traditional commerce where different 
competitive vendors are not willing to share informa- 
tion with each other (Maes et al., 1999). The level of 
willingness in ISA-based e-commerce is, however, 
somewhat limited due to sociological and ethical 
factors, which will be discussed later in this paper. In 
addition, designing and implementing ISA technology 
is a costly predicament preventing companies from 
adopting this emerging tool. Companies need to invest 
a lot of money to get the ISA engine started. Notwith- 
standing the exciting theoretical benefits discussed 
above, many companies are still not sure about how 
much ISA technology can benefit themselves in terms 
of revenue, ROI, and business influence in the market 
where other players are yet to adopt this technology 
to cooperate with each other. Particularly, medium or 
small size companies are reluctant to embark on this 
arena mainly due to the factor of cost. 

Additionally, lack of consistent architectures in 
terms of standards and laws also obstructs the further 
development of ISA technology (He et al., 2003). In 
detail, IT industry has not yet finalized the ISA stan- 
dards, as there are a number of proprietary standards 
set by various companies. This causes a confusion 
problem for ISAs to freely communicate with each 
other. Also, related to standards, relevant laws have 
not surfaced to regulate how ISAs can legally cooper- 
ate with each other and represent their human users in 
the cyber world. 

Additionally, ISA development and deployment 
is not a global perspective (Jennings et al. 1998). De- 
spite the fact that ISA technology is an ad-hoc topic 
in developed countries, developing countries are not 
fully aware of the benefits of ISA and therefore have 
not deployed ISA-based systems on the Web because 
their e-commerce development levels and skills are not 
as sophisticated or advanced as those of the developed 
countries. This intra-national limitation among devel- 
oped and developing countries unfortunately hinders 



946 



Intelligent Software Agents Analysis in E-Commerce II 



agents from freely communicating with each other over 
the globally connected Internet. 



SOCIOLOGICAL AND ETHICAL 
CHALLENGES 

In the preceding sections of this paper, the technical 
issues involved in agent development have been ad- 
dressed. However, in addition to these issues, there are 
also a range of social and cyber-ethical problems, such 
as trust and delegation, privacy, responsibility, and legal 
issues, which will become increasingly important in the 
field of agent technology (Bradshaw 1997; Jennings et 
al. 1998; Nwana 1996b). 

• Trust and delegation: For users who want to 
depend on ISA technology to obtain desired in- 
formation, they must trust agents which autono- 
mously delegate for users to do the job. It would 
take time for users to get used to their agents and 
gain confidence in the agents that work for them. 
And users have to make a balance between agents 
continually seeking guidance and never seeking 
guidance. Users might need to set proper limitations 
for their agents, otherwise agents might surpass 
their authorities. 

• Privacy: In the explosive information society, 
security is becoming more and more important. 
Therefore, users must make sure that their agents 
always maintain their privacy in the course of 
transactions. Electronic agent security policies may 
be needed to encounter this potential threat. 

• Responsibility: Users need to seriously consider 
how much responsibility the agents need to carry 
regarding the transaction pitfall. To some extent, 
agents are rendered responsibility to get the desired 
product/service for their users. If the users are not 
satisfied with the transaction result, they may need 
to redesign or reprogram the agent rather than di- 
rectly blame the fault on electronic agents. 

• Legal issues: In addition to responsibility, users 
should also think about any potential legal issues 
triggered by their agents, which, for instance, of- 
fer inappropriate advice to other agents resulting 
in liabilities to other people. This would be very 
challenging to the ISA technology development, 
and the scenario would be complicated since the 
current law does not specify which party (the 



company who wrote the agent, the company who 
customized and used the agent, or both) should be 
responsible for the legal issues. 
Cyber-ethical issues : Eichmann ( 1 994) and Etzioni 
& Weld (1994) proposed the following etiquettes 
or IS As which gather information on the Web. 
Agents must identify themselves; 
They must moderate the pace and frequency 
of their requests to some server; 
They must limit their searches to appropri- 
ate servers; 

They must share information with others; 
They must respect the authority placed on 
them by server operators; 
Their services must be accurate and up-to- 
date; 

Safety: they should not destructively alter 
the world; 

Tidiness: they should leave the world as 
they found it; 

Thrift: they should limit their consumption 
of scarce resources; 

Vigilance: they should not allow client ac- 
tions with unanticipated results. 



CONCLUSION AND FUTURE WORK 

IS Atechnology has to confront the increasing complex- 
ity of modem information environments. Research and 
development of ISAs on the Internet is crucial for the 
development of next generation in open information 
environments. Sociological and cyber-ethical issues 
need to be considered for the next generation of agents 
in e-commerce system, which will explore new types 
of transactions in the form of dynamic relationships 
among previously unknown parties (Guttman et al. 
1998b). According to Nwana (1996a), the ultimate 
ISA's success will be the acceptance and mass usage 
by users, once issues such as privacy, trust, legal, and 
responsibility are addressed and considered when 
users design and implement ISA technologies in e- 
commerce and emerging commerce, such as mobile 
commerce (M-commerce) and Ubiquitous commerce 
(U-commerce). It is expected that future research can 
further explore how ISAs are leveraged in these two 
newly emerged avenues. 
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KEY TERMS 

Agent: A computer system situated in some envi- 
ronment that is capable of autonomous action in this 
environment to meets its design objective. 

Business-to-Business E-Commerce: Electronic 
transaction of goods or services between businesses 
as opposed to that between businesses and other 
groups. 

Business-to-Customer E-Commerce: Electronic 
or online activities of commercial organizations serving 
the end consumer with products and/or services. It is 
usually applied exclusively to e-commerce. 

Customer-to-Customer E-Commerce: Online 
transactions involving the electronically-facilitated 
transactions between consumers through some third 
party. 

Electronic Commerce (E-Commerce): Consists of 
the buying and selling of products or services over elec- 
tronic systems such as the Internet and other computer 
networks. A wide variety of commerce is conducted 
in this way, including electronic funds transfer, supply 
chain management, e-marketing, online transaction 
processing, and automated data collection systems. 

Intelligent Software Agent: A software agent that 
uses Artificial Intelligence (AI) in the pursuit of the 
goals of its clients. 
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Ubiquitous Commerce (U-Commerce): The 

ultimate form of e-commerce and m-commerce in an 
' anytime, anywhere' fashion. It involves the use of 
ubiquitous networks to support personalized and un- 
interrupted communications and transactions at a level 
of value that far exceeds traditional commerce. 
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INTRODUCTION 

Most people are familiar with the concept of agents in 
real life. There are stock-market agents, sports agents, 
real-estate agents, etc. Agents are used to filter and 
present information to consumers. Likewise, during 
the last couple of decades, people have developed 
software agents, that have the similar role. They behave 
intelligently, run on computers, and are autonomous, 
but are not human beings. 

Basically, an agent is a computer program that is 
capable of performing a flexible and independent action 
in typically dynamic and unpredictable domains (Luck, 
McBurney, Shehory, & Willmott, 2005). Agents are 
capable of performing actions and making decisions 
without the guidance of a human. Software agents 
emerged in the IT because of the ever-growing need for 
information processing, and the problems concerning 
dealing and working with large quantities of data. 

Especially important is how agents act with other 
agents in the same environment, and the connections 
they form to find, refine and present the information 
in a best way. Agents certainly can do tasks better if 
they perform together, and that is why the multi-agent 
systems were developed. 

The concept of an agent has become important 
in a diverse range of sub-disciplines of IT, including 
software engineering, networking, mobile systems, 
control systems, decision support, information reco- 
very and management, e-commerce, and many others. 
Agents are now used in an increasingly wide number 
of applications — ranging from comparatively small 
systems such as web or e-mail filters to large, complex 
systems such as air-traffic control, that have a large 
dependency on fast and precise decision making. 



Undoubtedly, the main contribution to the field 
of intelligent software agents came from the field of 
artificial intelligence (AI). The main focus of AI is to 
build intelligent entities and if these entities sense and 
act in some environment, then they can be considered 
agents (Russell &Norvig, 1995). Also, object-oriented 
programming (Booch, 2004), concurrent object-based 
systems (Agha, Wegner, andYonezawa, 1993), and hu- 
man-computer interaction (Maes, 1994) are fields that 
constantly drive forward the development of agents. 



BACKGROUND 

Although the term 'agent' is widely used, by many 
people working in closely related areas, it defies attempts 
to produce a single universally accepted definition. 
One of the most broadly used definitions states that 
"an agent is an encapsulated computer system that 
is situated in some environment, and that is capable 
of flexible, autonomous action in that environment in 
order to meet its design objectives" (Wooldridge and 
Jennings, 1995). 

There are three main concepts in this definition: 
situatedness, autonomy, and flexibility: 

Situatedness means that an agent is situated in 
some environment and that it receives sensory 
input and performs actions which change that 
environment in some way. 
Autonomy is the ability of an agent to act without 
the direct intervention of humans. It has control 
over its own actions and over its internal state. 
Also, the autonomy implies the capability of 
learning from experience. 
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Flexibility means that the agent is able to per- 
ceive its environment and respond to changes 
in a timely fashion; it should be able to exhibit 
opportunistic, goal-directed behaviour and take 
the initiative whenever appropriate. In addition, 
an agent should be able to interact with other 
agents and humans, thus to be ' social'. 

For some researchers - particularly those interested 
in AI - the term " agent' has a stronger and more specific 
meaning than that sketched out above. These researchers 
generally mean an agent to be a computer system that, 
in addition to having the properties identified above, is 
either conceptualized or implemented using concepts 
that are more usually applied to humans. For example, 
it is quite common in AI to characterize an agent using 
mentalistic notions, such as knowledge, belief, inten- 
tion, and obligation (Wooldridge & Jennings, 1995). 



INTELLIGENT SOFTWARE AGENTS 
Agents and Environments 

An agent collects its percepts through its sensors, and 
acts upon the environment through its actuators. Thus, 
the agent is proactive. Its actions in any moment de- 
pend on the whole sequence of these inputs up to that 
moment. A decision tree for every possible percept 



Figure 1. Agent and environment 
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sequence of an agent would completely define the 
agent's behavior. This would define the function that 
maps any sequence of percepts to the concrete action 
- the agent function. The program that defines the agent 
function is called the agent program. So, the agent 
function is a formal description of the agent's behavior, 
and the agent program is a concrete implementation 
of that formalism. (Krcadinac, Stankovic, Kovanovic 
& Jovanovic, 2007) 

To implement all this, we need to have a comput- 
ing device with appropriate sensors and actuators on 
which the agent program will run. This is called agent 
architecture. So, an agent is essentially made of two 
components: the agent architecture and the agent 
program. 

Also, as Russell and Norvig (1995) specify, one of 
the most sought after characteristics of an agent is its 
rationality. An agent is rational if it always does the 
action that will lead to the most successful outcome. 
The rationality of an agent depends on (a) the perfor- 
mance measure that defines what is a good action and 
what is a bad action, (b) the agent's knowledge about 
the environment, (c) the agent's available actions, and 
(d) the agent's percept history. 

The Types of Agents 

There are several basic types of agents with respect to 
their structure (Russell & Norvig, 1995): 

1. The simplest kind of agents are the simple reflex 
agents. Such an agent only reacts to its current 
percept, completely ignoring its percept history. 
When a new percept is received, a rule that maps 
that percept to an action is activated. Such rules 
are known as condition-action rules. 

2. Model-based reflex agents are more powerful 
agents, because they maintain some sort of in- 
ternal state of the environment that depends on 
the percept history. For maintaining this sort of 
information, an agent must know how the envi- 
ronment evolves, and how its actions affect the 
environment. 

3. Goal-based agents have some sort of goal in- 
formation that describes desirable states of the 
world. Such an agent's decision making process 
is fundamentally different, because when a goal- 
based agent is considering performing an action 
it is asking itself "would this action make me 
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happy?" along with the standard "what this action 
will have as a result?". 

4. Utility-based agents use a utility function that 
maps each state to a number that represents the 
degree of happiness. They are able to perform 
rationally even in the situations when there are 
conflicting goals, as well as when there are sev- 
eral goals that can be achieved, but none with 
certainty. 

5. Learning agents do not have a priori knowledge 
of the environment, but learn about it. This is 
beneficial because these agents can operate in 
unknown environments and to a certain degree 
facilitates the job of developers because they 
do not need to specify their whole knowledge 
base. 

Multi-Agent Systems 

Multi-Agent Systems (MAS) are systems composed 
of multiple autonomous components (agents). They 
historically belong to Distributed Artificial Intelligence 
(Bond & Gasser, 1998). MAS can be defined as a 
loosely coupled network of problem solvers that work 
together to solve problems that are beyond the individual 
capabilities or knowledge of a single problem solver 
(Durfee and Lesser, 1989). In a MAS, each agent has 
incomplete information or capabilities for solving the 
problem and thus has a limited viewpoint. There is no 
global system control, the data is decentralized and the 
computation is asynchronous. 

In addition to MAS, there is also the concept of 
a multi-agent environment, which can be seen as an 
environment that includes more than one agent. Thus, 
it can be cooperative, or competitive, or a combined 
one, and creates a setting where agents need to interact 
(socialize) between each other, either to achieve their 
individual objectives, or to manage the dependencies 
that follow from being situated in a common environ- 
ment. These interactions range from simple semantic 
interoperation (exchanging comprehensible com- 
munications), client-server interactions (the ability to 
request that a particular action is performed), to rich 
social interactions (the ability to cooperate, coordinate, 
and negotiate about a course of action). 

Because of the issues due to heterogeneous nature 
of agents involved in communication (e.g., finding 
one another), there is also a need for middle-agents, 
which cover cooperation among agents and connect 



service providers with service requesters in the agent 
world. These agents are useful in various roles, such 
as matchmakers or yellow page agents that collect and 
process service offers ("advertisements"), blackboard 
agents that collect requests, and brokers that process 
both (Sycara, Decker, & Williamson, 1997). There are 
several alternatives to middle agents, such as Electronic 
Institutions - a framework for Agents' Negotiation 
which seeks to incorporate organizational concepts 
into multi-agent systems. (Rocha and Oliveira, 2001) 
Communication among agents is achieved by 
exchanging messages represented by mutually under- 
standable language (syntax) and containing mutually 
understandable semantics. In order to find a common 
ground for communication, an agent communication 
language {A CL) should be used to provide mechanisms 
for agents to negotiate, query, and inform each other. 
The most important such languages today are KQML 
(Knowledge Query and Manipulation Language) 
(ARPA Knowledge Sharing Initiative, 1993) and FIPA 
^CZ(FIPA, 1997). 



AGENT APPLICABILITY 

There are great possibilities for applying multi-agent 
systems to solving different kinds of practical prob- 
lems. 

Auction negotiation model, as a form of commu- 
nication, enables a group of agents to find good 
solutions by achieving agreement and making 
mutual compromises in case of conflicting goals. 
Such an approach is applicable to trading systems, 
where agents act on behalf of buyers and sellers. 
Financial markets, as well as scheduling, travel 
arrangement, and fault diagnosing also represent 
applicable fields for agents. 
• Another very important field is information gath- 
ering, where agents are used to search through 
diverse and vastly different information sources 
(e.g., World Wide Web) and acquire relevant 
information for their users. One of the most 
common domains is Web browsing and search, 
where agents are used to adapt the content (e.g., 
search results) to the users' preferences and offer 
relevant help in browsing. 
Process control software systems require various 
kinds of automatic (autonomous) control and re- 
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action for its processes (e.g. production process). 
Reactive and responsive, agents perfectly fit the 
needs of such a task. Example domains in this 
field include: production process control, climate 
monitoring, spacecraft control, and monitoring 
nuclear power plants. 

Artificial life studies the evolution of agents, or 
populations of computer simulated life forms 
in artificial environments. The goal is to study 
phenomena found in real life evolution in a con- 
trolled manner, hopefully to eliminate some of the 
inherent limitations and cruelty of evolutionary 
studies using live animals. 
Finally, intelligent tutoring systems often include 
pedagogical agents, which represent software 
entities constructed to present the learning content 
in a user- friendly fashion and monitor the user's 
progress through the learning process. These 
agents are responsible for guiding the user and 
suggesting additional learning topics related to 
the user's needs (Devedzic, 2006). 

Some of the more specific examples of intelligent 
agent applications include Talaria System, military 
training, and Mobility Agents. Talaria System (The 
Autonomous Lookup And Report Internet Agent Sys- 
tem) is a multi-agent system, developed for academic 
purposes at the University of Belgrade, Serbia. It was 
built as a solution to the common problem of gathering 
information from diverse Web sites that do not provide 
RSS feeds for news tracking. The system was imple- 
mented using the JADE modeling framework in Java. 
(Stankovic, Krcadinac, Kovanovic & Jovanovic, 2007) 
Talaria System is using the advantages of human-agent 
communication model to improve usability of web 
sites and to relieve users from annoying and repetitive 
work. The system provides each user with a personal 
agent, which periodically monitors the Web sites that 
the user expressed interest in. The agent informs its 
user about relevant changes, filtered by assumed user 
preferences and default relevance factors. Human-agent 
communication is implemented via email, so that a user 
can converse with her/his agent in natural language, 
whereas the agent heuristically interprets concrete in- 
structions from the mail text (e.g., "monitor this site" 
or "kill yourself). 

Simulation and modelling are extensively used in 
a wide range of military applications, from develop- 
ment, testing and acquisition of new systems and 



technologies, to operation, analysis and provision of 
training, and mission rehearsal for combat situations. 
The Human Variability in Computer Generated Forces 
(HV-CGF) project, undertaken on behalf of the UK's 
Ministry of Defence, developed a framework for simu- 
lating behavioral changes of individuals and groups 
of military personnel when subjected to moderating 
influences such as caffeine and fatigue. The project 
was built with the JACK Intelligent Agents toolkit, a 
commercial Java-based environment for developing and 
running multiagent applications. Each team member 
is a rational agent able to execute actions such as doc- 
trinal and non-doctrinal behaviour tactics, which are 
encoded as JACK agent graphical plans. (Belecheanu 
et al., 2005) 

Mobility Agents is an agent-based architecture that 
helps a person with cognitive disabilities to travel us- 
ing public transportation. Agents are used to represent 
transportation participants (buses and travelers) and 
to enable notification of bus approaching and arrival. 
Information is passed to the traveler using a multimedia 
interface, via a handheld device. Customizable user 
profiles determine the most appropriate modality of 
interaction (voice, text, and pictures) based on the user 's 
abilities (Repenning & Sullivan, 2003). This imposes 
a personal agent to take care that abstract goals, as 
"go home", are translated into concrete directions. To 
achieve this, an agent needs to collect information about 
user-specific locations and must be able to suggest the 
right bus for the particular user's current location and 
destination. 



FUTURE TRENDS 

Future looks bright for this technology as development 
is taking place within a context of broader visions and 
trends in IT. The whole growing field of IT is about 
to drive forward the R&D of intelligent agents. We 
especially emphasize the Semantic Web, ambient in- 
telligence, service oriented computing, Peer-to-peer 
computing and Grid Computing. 

The Semantic Web is the vision of the future Web 
based on the idea that the data on the Web can be defined 
and linked in such a way that it can be used by machines 
for the automatic processing and integration (Berners- 
Lee, Hendler, & Lassila, 2001). The key to achieving 
this is by augmenting Web pages with descriptions 
of their content in such a way that it is possible for 
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machines to reason automatically about that content. 
The common opinion is that the Semantic Web itself 
will be a form of intelligent infrastructure for agents, 
allowing them to "understand" the meaning of the data 
on the Web (Luck et al., 2005). 

The concept of ambient intelligence describes a 
shift away from PCs to a variety of devices which are 
embedded in our environment and which are accessed 
via intelligent interfaces. It requires agent-like tech- 
nologies in order to achieve autonomy, distribution, 
adaptation, and responsiveness. 

Service oriented computing is where MAS could 
become very useful. In particular, this might involve 
web services, where the Quality Of Service demands 
are important. Each web service could be modeled as 
an agent, with dependencies, and then simulated for 
observed failure rates. 

Peer-to-peer (P2P) computing, presenting net- 
worked applications in which every node is in some 
sense equivalent to all others, tends to become more 
complex in the future. Auction mechanism design, 
agent negotiation techniques, increasingly advanced 
approaches to trust and reputation, and the application 
of social norms, rules and structures - presents some 
of the agent technologies that are about to become 
relevant in the context of P2P computing. 

Grid Computing is the high-performance agent- 
based computing infrastructure for supporting large- 
scale distributed scientific endeavour. The Grid provides 
a means of developing eScience applications, yet it 
also provides a computing infrastructure for support- 
ing more general applications that involve large-scale 
information handling, knowledge management and 
service provision. The key benefit of Grid computing 
is flexibility - the distributed system and network can 
be reconfigured on demand in different ways as busi- 
ness needs change. 

Some considerable challenges have still remained 
in the agent-based world, such as the lack of sophis- 
ticated software tools, techniques and methodologies 
that would support the specification, development, 
integration and management of agent systems. 



CONCLUSION 

Today, research and development in the field of intel- 
ligent agents is rapidly expanding. At its core is the 
concept of autonomous agents interacting with one 



another for their individual and/or collective benefit. A 
number of significant advances have been made over 
the past two decades in design and implementation of 
individual autonomous agents, and in the way in which 
they interact with one another. These concepts and 
technologies are now finding their way into commercial 
products and real-world software solutions. Future IT 
visions share the common need for agent technologies 
and prove that agent technologies will continue to be 
of vital importance. It is foreseeable that agents will 
become the integral part of informational technologies 
and artificial intelligence in the near future, and that is 
why they should be kept an eye on. 
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KEY TERMS 

Actuators: Software component and part of the 
agent used as a mean of performing actions in the 
agent environment. 

Agent Autonomy: Agent's active use of its capa- 
bilities to pursue some goal, without intervention by 
any other agent in the decision-making process used 
to determine how that goal should be pursued (Barber 
& Martin, 1999). 

Agent Percepts: Every information that an agent 
receives trough it's sensors, about the state of the en- 
vironment or any part of the environment. 

Intelligent Software Agent: An encapsulated 
computer system that is situated in some environment 
and that is capable of flexible, autonomous action in 
that environment in order to meet its design objectives 
(Wooldridge & Jennings, 1995). 

Middle-Agents: Agents that facilitate coopera- 
tion among other agents and typically connect service 
providers with service requesters. 

Multi- Agent System (MAS): A software system 
composed of several agents that interact in order to 
find solutions of complex problems. 

Sensors: Software component and part of the agent 
used as a mean of acquiring information about current 
state of the agent environment (i.e., agent percepts). 
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INTRODUCTION 

The Artificial Neural Networks (ANNs) are based on 
the behavior of the brain. So, they can be considered 
as intelligent systems. In this way, the ANNs are con- 
structed according to a brain, including its main part: 
the neurons. Moreover, they are connected in order to 
interact each other to acquire the followed intelligence. 
And finally, as any brain, it needs having memory, which 
is achieved in this model with their weights. 

So, starting from this point of view of the ANNs, we 
can affirm that these systems are able to learn difficult 
tasks. In this article, the task to learn is to distinguish 
between different kinds of traffic signs. Moreover, this 
ANN learning must be done for traffic signs that are not 
in perfect conditions. So, the learning must be robust 
against several problems like rotation, translation or 
even vandalism. In order to achieve this objective, an 
intelligent extraction of information from the images is 
done. This stage is very important because it improves 
the performance of the ANN in this task. 



BACKGROUND 

The Traffic Sign Classification (TSC) problem has 
been studied many times in the literature. This problem 
is solved in (Perez, 2002, Escalera, 2004) using the cor- 
relation between the traffic sign and each element of a 
database, which involves large computational cost. In 
(Hsu, 2001), Matching Pursuit (MP) is applied in two 



stages: training and testing. The training stage finds a 
set of the best MP filters for each traffic sign, while the 
testing one projects the unknown traffic sign to differ- 
ent MP filters to find the best match. This method also 
implies large computational cost, especially when the 
number of elements grows up. In recent works (Escalera, 
2003, Vicen, 2005a, Vicen, 2005b), the use of ANNs 
is studied. The first one studies the combination of the 
Adaptive Resonance Theory with ANNs. It is applied 
to the whole image, where many traffic signs can ex- 
ist, which involves that the ANN complexity must be 
very high to recognize all the possible signs. In the last 
works, the TSC is constructed using a preprocessing 
stage before the ANN, which involves a computational 
cost reduction in the classifier. 

TSCs are usually composed by two specific stages: 
the detection of traffic signs in a video sequence or 
image and their classification. In this work we pay 
special attention to the classification stage. The per- 
formance of these stages highly depends on lighting 
conditions of the scene and the state of the traffic sign 
due to deterioration, vandalism, rotation, translation 
or inclination. Moreover, its perfect position is per- 
pendicular to the trajectory of the vehicle, however 
many times it is not like that. Problems related to the 
traffic sign size are of special interest too. Although 
the size is normalized, we can find signs of different 
ones, because the distance between the camera and the 
sign is variable. So, the classification of a traffic sign 
in this environment is not easy. 
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The objective of this work is the study of different 
classification techniques combined with different pre- 
processings to implement an intelligent TSC system. 
The preprocessings considered are shown below and 
are used to reduce the classifier complexity and to 
improve its performance. The studied classifiers are 
the k-Nearest Neighbor (k-NN) and an ANN based 
method using Multilayer Perceptrons (MLPs). So, 
this work tries to find which are the best preprocess- 
ings, the best classifiers and which combination of 
them minimizes the error rate. 



INTELLIGENT TRAFFIC SIGN 
CLASSIFICATION 

An intelligent traffic sign classification can be achieved 
taking into account two important aspects. The first one 
focus on the extraction of the relevant information of 
the input traffic signs, which can be done adaptively or 
fixed. The second one is related with the classification 
core. From the point of view of this part, ANNs can 
play a great role, because they are able to learn from 
different environments. So, an intelligent combination 
of both aspects can lead us to the success in the clas- 
sification of traffic signs. 

Traffic Sign Classification System 
Overview 

The TSC system and the blocks that compose it are 
shown in figure 1 . Once the Video Camera block takes a 



video sequence, the Image Extraction block makes the 
video sequence easy to read and it is the responsible to 
obtain images. The Sign Detection and Extraction Stage 
extracts all the traffic signs contained in each image 
and generates the small images called blobs, one per 
possible sign. Figure 1 also shows an example of the way 
this block works. The Color Recognition Stage is the 
responsible to discern among the different predominant 
color of the traffic sign: blue, red or others. Once the 
blob is classified according to its predominant color, the 
TSC Stage has the responsibility to recognize the exact 
type of signal, which is the aim of this work. This stage 
is divided in two parts: the traffic sign preprocessing 
stage and the TSC core. 

Database Description 

The database of blobs used to obtain the results pre- 
sented in this work is composed of blobs with only 
noise and nine different types of blue traffic signs, 
which belong to the international traffic code. Figure 
2. a (Normal Traffic Signs) shows the different classes 
of traffic signs considered in this work, which have 
been collected by the TSC system presented above. So, 
they present distortions due to the problems described 
in previous sections, which are shown in figure 2.b 
(Traffic Signs with problems). The problems caused 
by vandalism are shown in the example of class S . 
The problems related to the blob extraction in the Sign 
Detection and Extraction Stage (not a correct fit in the 
square image) are shown in the examples of classes S 2 , 
S 4 and S 9 . Examples of signs with problems of rotation, 
translation or inclination are those of classes S 4 , S 6 and 




Figure 1. Traffic sign classification system 
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Figure 2. Noise and nine classes of international traffic signs: (a) Normal traffic signs and (b) Traffic signs with 
problems 
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S g . Finally, the difference of brightness is observed in 
both parts of figure 2. For example, when the lighting 
of the blob is high, the vertical row of the example of 
class S 3 is greater than horizontal row of the example 
of class S,. 



Traffic Sign Preprocessing Stage 

Each blob presented at the input of the TSC stage 
contains information of the three-color components: 
red, green and blue. Each blob is composed of 3 1x3 1 
pixels. So, the memory required for each blow is 2883 
bytes. Due to the high quantity of data, the purpose of 
this stage is to reduce it and to limit the redundancy of 
information, in order to improve the TSC performance 
and to reduce the TSC core computational cost. 

The first preprocessing made in this stage is the 
transformation of the color blob (3x31x31) to a gray 
scale blob (31x31) (Paulus, 2003). Consider for the 
next explanation that M is a general bidimensional 
matrix that contains either the gray scale blob or the 
output of one of the next preprocessings: 

Median filter (MF) (Abdel, 2004). It is applied 
to each pixel of M. A block of nxn elements that 
surrounds a pixel of M is taken, which is sorted in 
a linear vector. The median value of this vector is 
selected as the value of the processed pixel. This 



preprocessing is usually used to reduce the noise 
in an image. 

Histogram equalization (HE). It tries to enhance 
the contrast of M. The pixels are transformed ac- 
cording to a specified image histogram (Paulus, 
2003). This equalization is usually used to improve 
the dynamic range of M. 
Vertical (VH) and horizontal (HH) histograms 
(Vicen, 2005a, Vicen, 2005b). They are computed 
with 



hh J=jjt( m u >T ) > J=12,..J1 



(1) 



(2) 



respectively, where m. . is the element of the i-th 
row and j-th column of the matrix M and T is the 
fixed or adaptive threshold of this preprocessing. 
If T is fixed, it is established at the beginning of 
the preprocessing, but if T is adaptive, it can be 
calculated with the Otsu method (Ng, 2004) or 
with the mean value of the blob, so both methods 
are M-dependent. vh. corresponds to the ratio of 
values of column j-th that are greater than T and 
hh. corresponds to the ratio of values of row i-th 
that are greater than T. 
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Traffic Sign Classification Core 

TSC can be formulated as a multiple hypothesis test. 
Consider that P(D.|S ) is the probability of deciding 
in favor of S. (decision D.) when the true hypothesis 
is S., C is the cost associated with this decision and 
P(S.) is the prior probability of hypothesis S.. Then the 
objective is to minimize a risk function that is given 
as the average cost C, which is defined in (3) for L 
hypothesis. 



C = 



ZZ C ^ WJ 



i=l j=l 



(3) 



The classifier performance can be given as the total 
error rate (P e ) and the total correct rate (P c =l-P e ) for 
all the hypothesis (classes). 

Traffic Sign Classification Core Based 
on Statistical Methods: The /c-NN 

The k-NN approach is a widely-used statistical method 
(Kisienski, 1975) applied in classification tasks. It as- 
sumes that the training set contains M points of class 
S. and M points in total, so 



Z M > = 



M 



Then a hypersphere around the observation point x is 
taken, which encompasses k points irrespective of their 
class label. Suppose this sphere, of volume V, contains 
k. points of class S., then 



p(x |S,> 



My 



(4) 



provides an approximation to this class-conditional 
density. The unconditional density can be estimated 
using 



p(x)» 



MV 



(5) 



while the priors can be estimated using 



«*7l 



(6) 




Then applying Bayes' theorem (Bishop, 1995), we 
obtain: 



P(S,\*) = 



p(x) ~ k 



(7) 



Thus, to minimize the probability of misclassify- 
ing x, it should be assigned to the class S. for which 
the ratio kJk is highest. The way to apply this method 
consists in comparing each x of the test set with all the 
training set patterns and deciding which class S. is the 
most appropriate one. k denotes the number of patterns 
that take part in the final decision of classifying x in 
class S.. When a draw exists in the majority voting, the 
decision is taking using the class of the nearest pattern. 
So, the results for k=l and k=2 are the same. 

Traffic Sign Classification Core Based 
on Neural Networks: The MLP 

The Perceptron was developed by F. Rosenblatt 
(Rosenblatt, 1962) in the 1960s for optical character 
recognition. The Perceptron has multiple inputs fully 
connected to an output layer with multiple outputs. Each 
output y. is the result of applying a linear combination 
of the inputs to a non-linear function called activation 
function. MLPs (Haykin, 1999) extend the Perceptron 
by cascading one or more extra layers of processing 
elements. These layers are called hidden layers, since 
their elements are not connected directly to the external 
world. The expression I/H/.../H h /0 denotes an MLP 
with I inputs (size of the observation vector x), h hid- 
den layers with H h neurons in each one and O outputs 
(size of the classification vector y). 

Cybenko 's theorem (Cybenko, 1989) states that any 
continuous function f # 9> n — > 9? can be approximated 
with any degree of precision by log-sigmoidal functions. 
Therefore, MLPs using the log-sigmoidal activation 
function for each neuron are selected. 

Gradient descent with momentum and adaptive 
learning rate backpropagation algorithm is used to 
train the MLPs, where the Mean Square Error (MSE) 
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criterion is minimized. Moreover, cross-validation is 
used in order to reduce generalization problems. 



RESULTS 

The database considered for the experiments is com- 
posed of 235 blobs often different classes: noise (S ) 
and nine classes of traffic signs (S 1 -S ). The database 
has been divided in three sets: train, validation and test, 
which are composed of 93, 52 and 78 blobs, respec- 
tively, being preprocessed before they are presented to 
the TSC core. The first one is used as the training set 
for the k-NN and the MLPs. The second one is used 
to stop the MLP training algorithm (Bishop, 1995, 
Haykin, 1999) according to the cross-validation ap- 
plied during the training. And the last one is used to 
evaluate the performance of the k-NN and the MLPs. 
Experimental environments characterized by a large 
dimensional space and a small data set pose generaliza- 
tion problems. For this reason, the MLPs training is 
repeated 10 times with different weights initialization 
each time and the best MLP in terms of P estimated 

e 

with the validation set is selected. 

Once the color blobs are transformed to gray scale, 
three different combinations of preprocessings (CPPs) 
are applied, so each CPP output is 62 elements: 

• The first combination (CPP 1 ) applies the VH and 
HH with an adaptive threshold T calculated with 
the mean of the blob. 

The second combination (CPP2) applies, in this 
order, the HE and the VH and HH with an adaptive 
threshold T calculated with the Otsu method. 
The third combination (CPP3) applies, in this 
order, the MF, the HE and the VH and HH with 
a fixed threshold (T=l 85). 

For the TSC core based on the k-NN, a study of the 
k parameter is made for the different CPPs considered 
in the experiments (table 1). The lowest error rate is 
achieved with CPP3 and k=l, which performance is 
P =6,4% (P =93,6%). 

For the TSC core based on MLPs, a study of the 
number of hidden layers (h) and the number of neurons 
in each one (H h ) is done. 

For the case of one hidden layer (h= 1 ), table 2 shows 
the results for the different CPPs. In this case, the best 



performance is obtained with the CPP3 and an MLP of 
62/62/10, where its error rate is P =2,6% (P =97,4%). 
The CPP2 achieves good performances but they are 
always lower than in the case of using the CPP3. The 
use of CPP 1 with MLPs achieves the worst results of 
the three cases under study. 

The study of the TSC core based on an MLP with 
two hidden layers (h=2) (table 3) shows that the best 
combination of the CPPs and [H p H 2 ] for the MLP is 
CPP3 and [^=70,^=20], respectively. In this case, the 
best performance achieved is P=T ,3% (P c =98,7%). As 
occurs for MLPs with one hidden layer, the best CPP 
is the third one and the worst one is the first one. 



FUTURE TRENDS 

New innovations can be achieved in this research area. 
The new trends try to improve the preprocessing tech- 
niques. In this case, advance signal processing can be 
applied to TSC. On the other hand, other TSC cores can 
be used. For instance, classifiers based on Radial Basis 
Function or Support Vector Machines (Maldonado, 
2007) can be applied. Finally, optimization techniques, 
like Genetic Algorithms, have an important role in 
this research area to find which is the best selection of 
preprocessings of a bank of them. 



CONCLUSION 

The performances of all the TSC designs are quite 
good, even though when the problems of deteriora- 
tion, vandalism, rotation, translation, inclination, not 
a correct fit in the 31x31 blob and variation in size 
exist in the blobs. 

Several combinations of preprocessings are used. 
The best one applies, in this order, the median filter, the 
histogram equalization and the vertical and horizontal 
histograms with a fixed threshold (T=185). 

Concerning the type of classifier, the best TSCs 
are always achieved with MLPs. Moreover, the best 
results are achieved by MLPs of two hidden lay- 
ers. The P e reduction of the TSC core based on a 
62/70/20/10 MLP (P =1,3%) is of 1,3% with respect 
to the best one achieved with only one hidden layer 
MLP (62/62/10) and 5,1% with respect to the best k- 
NN (k=l) achieved. 
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Table l.P (%) versus k parameter for each TSC based on different CPPs and k-NN 



k 


1 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


CPP1 


29,5 


30,8 


29,5 


29,5 


32,0 


30,8 


30,8 


28,2 


25,6 


25,6 


25,6 


CPP2 


19,2 


17,9 


14,1 


16,7 


14,1 


15,4 


19,2 


16,7 


17,9 


19,2 


19,2 


CPP3 


6,4 


9,0 


9,0 


11,5 


12,8 


12,8 


12,8 


12,8 


12,8 


12,8 


10,3 




Table 2. P/%) versus H 1 parameter for each TSC based on different CPPs and MLPs of sizes (62/H/10) 



«i 


6 


14 


22 


30 


38 


46 


54 


62 


70 


78 


86 


CPP1 


24,4 


17,9 


17,9 


15,4 


18,9 


16,7 


14,1 


17,9 


19,2 


17,9 


15,4 


CPP2 


21,8 


14,1 


14,1 


14,1 


12,8 


10,3 


11,5 


10,3 


12,8 


9,0 


11,5 


CPP3 


12,8 


3,8 


5,1 


3,8 


3,8 


5,1 


3,8 


2,6 


5,1 


5,1 


3,8 



Table 3. P/%) versus [H , H J parameters for each TSC based on different CPPs and MLPs of sizes (62/H/ 
H/10) 



»i 


10 


10 


15 


15 


25 


25 


40 


40 


60 


60 


70 


70 


H 2 


6 


8 


5 


7 


8 


10 


15 


20 


18 


25 


20 


30 


CPP1 


28,2 


24,4 


23,1 


25,6 


19,2 


19,2 


19,2 


19,2 


17,9 


17,9 


15,4 


15,4 


CPP2 


25,6 


25,6 


26,9 


23,1 


17,9 


20,5 


16,7 


11,5 


12,8 


11,5 


12,8 


9,0 


CPP3 


15,4 


10,3 


15,4 


12,8 


7,7 


5,1 


6,4 


5,1 


5,1 


5,1 


1,3 


5,1 
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KEY TERMS 

Artificial Neural Networks (ANNs): A network 
of many simple processors ("units" or "neurons") that 
imitates a biological neural network. The units are 
connected by unidirectional communication channels, 
which carry numeric data. Neural networks can be 
trained to find nonlinear relationships in data, and are 
used in applications such as robotics, speech recogni- 
tion, signal processing or medical diagnosis. 

Backpropagation Algorithm: Learning algorithm 
of ANNs, based on minimizing the error obtained from 
the comparison between the ANN outputs after the 
application of a set of network inputs and the desired 
outputs. The update of the weights is done according 
to the gradient of the error function evaluated in the 
point of the input space that indicates the input to the 
ANN. 

Classification: The act of distributing things into 
classes or categories of the same type. 

Detection: The perception that something has oc- 
curred or some state exists. 

Information Extraction: Obtention of the relevant 
aspects contained in data. It is commonly used to reduce 
the input space of a classifier. 

Pattern: Observation vector that for its relevance 
is considered as an important example of the input 
space. 

Preprocessing: Operation or set of operations 
applied to a signal in order to improve some aspects 
of it. 
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INTRODUCTION 

Today's e-commerce environment requires that interac- 
tive systems exhibit abilities such as autonomy, adaptive 
and collaborative behavior, and inferential capability. 
Such abilities are based on the knowledge about users 
and their tasks to be performed (Raisinghani, Klassen 
and Schkade, 2001). To adapt users' input and tasks 
an interactive system must be able to establish a set of 
assumptions about users' profiles and task characteris- 
tics, which is often referred as user models. However, 
to develop a user model an interactive system needs 
to analyze users' input and recognize the tasks and 
the ultimate goals users trying to achieve, which may 
involve a great deal of uncertainties. 

Uncertainty refers to a set of values about a piece of 
assumption that cannot be determined during a dialog 
session. In fact, the problem of uncertainty in reasoning 
processes is a complex and difficult one. Information 
available for user model construction and reasoning 
is often uncertain, incomplete, and even vague. The 
propagation of such data through an inference model 
is also difficult to predict and control. Therefore, the 
capacity of dealing with uncertainty is crucial to the 
success of any knowledge management system. 

Currently, a vigorous debate is in progress concern- 
ing how best to represent and process uncertainties in 
knowledge based systems. This debate carries great 
importance because it is not only related to the con- 
struction of knowledge based system but also focuses 
on human thinking in which most decisions are made 
under conditions of uncertainty. This chapter presents 
and discusses uncertainties in the context of user model- 
ing in interactive systems. Some elementary distinctions 
between different kinds of uncertainties are introduced. 
The purpose is to provide an analytical overview and 
perspective concerning how and where uncertainties 



arise and the major methods that have been proposed 
to cope with them. 

Sources of Uncertainties 

The user model based interactive systems face the 
problems of uncertainty in the reference rule, the facts, 
and representation languages. There is no widely ac- 
cepted definition about the presence of uncertainty in 
user modeling. However, the nature of uncertainty in 
a user model can be investigated through its origin. 
Uncertainty can arise from a variety of sources. Several 
authors have emphasized the need for differentiating 
among the types and sources of uncertainty. Some of 
the major sources are as follows: 

(I) The imprecise and incomplete information obtained 
from the user 's input. This type of source is related to 
the reliability of information, which involves the fol- 
lowing aspects: 

• Uncertain or imprecise information exists in the 
factual knowledge (Dutta, 2005). The contents 
of a user model involve uncertain factors. For 
instance, the system might want to assert "It is 
not likely that this user is a novice programmer." 
This kind of assertion might be treated as a piece of 
knowledge. But it is uncertain and seems difficult 
to find a numerical description for the uncertainty 
in this statement (i.e., no appropriate sample space 
in which to give this statement statistical meaning, 
if a statistical method is considered for capturing 
the uncertainty). 

The default information often brings uncertain 
factors to inference processes (Reiter, 1980). For 
example, the stereotype system carries extensive 
default assumptions about a user. Some assump- 
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tions may be subject to change as interaction 
progresses. 

Uncertainty occurs as a result of ill-defined con- 
cepts in the observations or due to inaccuracy and 
poor reliability of the measurement (Kahneman 
and Tversky, 1982). For example, a user's typing 
speed could be considered as a measurement for a 
user's file editing skill. But for some applications 
it may be questionable. 

The uncertain exception to general assump- 
tions for performing some actions under some 
circumstances can cause conflicts in reasoning 
processes. 

(2) Inexact language by which the information is 
conveyed. The second source of uncertainty is caused 
by the inherent imprecision or inexactness of the 
representation languages. The imprecision appears in 
both natural languages and knowledge representation 
language. It has been proposed to classify three kinds 
of inexactness in natural language (Zwick, 1999). The 
first is generality, in which a word applies to a multi- 
plicity of objects in the field or reference. For example, 
the word "table" can apply to objects differing in size, 
shape, materials, and functions. The second kind of 
linguistic exactness is ambiguity, which appears when a 
limited number of alternative meanings have the same 
phonetic form {e.g., bank). The third is vagueness, in 
which there are no precise boundaries to the meaning 
of the word {e.g., old, rich). 

In knowledge representation languages employed 
in user modeling systems, if rules are not expressed 
in a formal language, their meaning usually cannot be 
interpreted exactly. This problem has been partially 
addressed by the theory of approximate reasoning. 
Generally, a proposition {e.g., fact, event) is uncertain 
if it involves a continuous variable. Note that an exact 
assumption may be uncertain {e.g., the user is able to 
learn this concept), and an assumption that is absolutely 
certain may be linguistically inexact (e.g. the user is 
familiar with this concept). 

(3) Aggregation or summarization of information. The 
third type of uncertainty source arises from aggrega- 
tion of information from different knowledge sources 
or expertise (Bonissone and Tong, 2005). Aggregating 
information brings several potential problems that are 
discussed in (Chen and Nocio 1997). 



(4) Deformation while transferring knowledge. There 
might be no semantic correspondence between one 
representation language to another. It is possible that 
there is even no appropriate representation for certain 
expertise, for example, the measurement of user's 
mental workload. This makes the deformation of 
transformation inevitable. In addition, human factors 
greatly affect the procedure of information translation. 
Several tools that use cognitive models for knowledge 
acquisition have been presented (Jacobson and Freil- 
ing, 1988). 



CONCLUSION 

Generally, uncertainty affects the performance of an 
adaptive interface in the following aspects and obvi- 
ously, the management of uncertainty must address all 
of the following aspects (Chen and Norcio, 2001). 

How to determine the degree to which the premise 
of a given rule has been satisfied. 
How to verify the extent to which external con- 
straints have been met. 

How to propagate the amount of uncertain infor- 
mation through triggering of a given rule. 
How to summarize and evaluate the findings 
provided by various rules or domain expertise. 
How to detect possible inconsistencies among 
the various sources and, 

How to rank different alternatives or different 
goals. 
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KEY TERMS 

Interactive System: A system that allows dialogs 
between the computer and the user. 

Knowledge Based Systems: A computer system 
that programmed to imitate human problem-solving 
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by means of artificial intelligence and reference to a 
database of knowledge on a particular subject. 

Knowledge Representation: The notation or for- 
malism used for coding the knowledge to be stored in 
a knowledge-based system. 

Stereotype: A set of assumptions based on con- 
ventional, formulaic, and simplified conceptions, 
opinions about a user, which is created by an interac- 
tive system. 

Uncertainties: A potential deficiency in any phase 
or activity of the modeling process that is due to the 
lack of knowledge 

User Model: A set of information an interactive 
system infers or collects, which is used to characterize a 
user's tasks, goals, domain knowledge and preferences, 
etc. to facilitate human computer interaction. 
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INTRODUCTION 

Since its genesis, fuzzy sets (FSs) theory (Zadeh, 1 965) 
provided a flexible framework for handling the inde- 
terminacy characterizing real-world systems, arising 
mainly from the imprecise and/or imperfect nature of 
information. Moreover, fuzzy logic set the foundations 
for dealing with reasoning under imprecision and of- 
fered the means for developing a context that reflects 
aspects of human decision-making. Images, on the 
other hand, are susceptible of bearing ambiguities, 
mostly associated with pixel values. This observation 
was early identified by Prewitt (1 970), who stated that 
"a pictorial object is a fuzzy set which is specified 
by some membership function defined on all picture 
points", thus acknowledging the fact that "some of 
its uncertainty is due to degradation, but some of it is 
inherent". A decade later, Pal & King (1980) (1981) 
(1982) introduced a systematic approach to fuzzy im- 
age processing, by modelling image pixels using FSs 
expressing their corresponding degrees of brightness. A 
detailed study of fuzzy techniques for image process- 
ing and pattern recognition can be found in Bezdek et 
al and Chi et al (Bezdek, Keller, Krisnapuram, & Pal, 
1999) (Chi, Yan, & Pham, 1996). 

However, FSs themselves suffer from the require- 
ment of precisely assigning degrees of membership to 
the elements of a set. This constraint raises some of the 
flexibility of FSs theory to cope with data character- 
ized by uncertainty. This observation led researchers 
to seek more efficient ways to express and model im- 
precision, thus giving birth to higher-order extensions 
of FSs theory. 

This article aims at outlining an alternative approach 
to digital image processing using the apparatus of 
Atanassov's intuitionistic fuzzy sets (A-IFSs), a simple, 
yet efficient, generalization of FSs. We describe heu- 
ristic and analytic methods for analyzing/synthesizing 
images to/from their intuitionistic fuzzy components 



and discuss the particular properties of each stage of 
the process. Finally, we describe various applications 
of the intuitionistic fuzzy image processing (IFIP) 
framework from diverse imaging domains and provide 
the reader with open issues to be resolved and future 
lines of research to be followed. 



BACKGROUND 

From the very beginning of their development, FSs 
intrigued researchers to apply the flexible fuzzy frame- 
work in different domains. In contrast with ordinary 
(crisp) sets, FSs are defined using a characteristic 
function, namely the membership function, which maps 
elements of a universe to the unit interval, thereby 
attributing values expressing the degree ofbelonging- 
ness with respect to the set under consideration. This 
particular property of FSs theory was exploited in the 
context of digital image processing and soon turned 
out to be a powerful tool for handling the inherent 
uncertainty carried by image pixels. The importance of 
fuzzy image processing was rapidly acknowledged by 
both theoreticians and practitioners, who exploited its 
potential to perform various image-related tasks, such as 
contrast enhancement, thresholding and segmentation, 
de-noising, edge-detection, and image compression. 

However, and despite their vast impact to the design 
of algorithms and systems for real-world applications, 
FSs are not always able to directly model uncertainties 
associated with imprecise and/or imperfect information. 
This is due to the fact that their membership functions 
are themselves crisp. These limitations and drawbacks 
characterizing most ordinary fuzzy logic systems 
(FLSs) were identified and described by Mendel & 
Bob John (2002), who traced their sources back to the 
uncertainties that are present in FLSs and arise from 
various factors. The very meaning of words that are 
used in the antecedents and consequents of FLSs can 
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be uncertain, since some words may often mean dif- 
ferent things to different people. Moreover, extracting 
the knowledge from a group of experts who do not all 
agree, leads in consequents having a histogram of values 
associated with them. Additionally, data presented as 
inputs to an FLS, as well as data used for its tuning, 
are often noisy, thus bearing an amount of uncertainty. 
As a result, these uncertainties translate into additional 
uncertainties about FS membership functions. Finally, 
Atanassov et al. (Atanassov, Koshelev, Kreinovich, 
Rachamreddy & Yasemis, 1998) proved that there ex- 
ists a fundamental justification for applying methods 
based on higher-order FSs to deal with everyday-life 
situations. Therefore, it comes as a natural consequence 
that such an extension should also be carried in the field 
of digital image processing. 



THE IFIP FRAMEWORK 

In quest for new theories treating imprecision, various 
higher-order extensions of FSs were proposed by differ- 
ent scholars. Among them, A-IFSs (Atanassov, 1986) 
provide a simple and flexible, yet solid, mathematical 
framework for coping with the intrinsic uncertainties 
characterizing real-world systems. A-IFSs are defined 
using two characteristic functions, namely the member- 
ship and the non-membership that do not necessarily 
sum up to unity. These functions assign to elements 
of the universe corresponding degrees of belonging- 
ness and non-belongingness with respect to a set. The 
membership and non-membership values induce an 
indeterminacy index, which models the hesitancy of 
deciding the degree to which an element satisfies a 
particular property. In fact, it is this additional degree of 
freedom that provides us with the ability to efficiently 
model and minimize the effects of uncertainty due to the 
imperfect and/or imprecise nature of information. 

Hesitancy in images originates out of various factors, 
which in their majority are caused by inherent weak- 
nesses of the acquisition and the imaging mechanisms. 
Distortions occurred as a result of the limitations of the 
acquisition chain, such as the quantization noise, the sup- 
pression of the dynamic range, or the nonlinear behavior 
of the mapping system, affect our certainty regarding 
the "brightness" or "edginess" of a pixel and therefore 
introduce a degree of hesitancy associated with the cor- 
responding pixel. Moreover, dealing with "qualitative" 
rather than "quantitative" properties of images is one 



of the sound advantages of fuzzy-based techniques. 
Qualitative properties describe in a more natural and 
human-centric manner image attributes, such as the 
"contrast" and the "homogeneity" of an image region, 
or the "edginess" of a boundary. However, as already 
pointed out, these terms are themselves imprecise and 
thus they additionally increase the uncertainty of image 
pixels. It is therefore a necessity, rather than a luxury, 
to employ A-IFSs theory to cope with the uncertainty 
present in real- world images. 

In order to apply the IFIP framework, images should 
first be expressed in terms of elements of A-IFSs theory. 
Analyzing and synthesizing digital images to and from 
their corresponding intuitionistic fuzzy components is 
not a trivial task and can be carried out using either 
heuristic or analytic approaches. 

Heuristic Modelling 

As already stated, the factors introducing hesitancy in 
real-world images can be traced back to the acquisition 
stage of imaging systems and involve pixel degradation, 
mainly triggered by the presence of quantization noise 
generated by the A/D converters, as well as the sup- 
pression of the dynamic range caused by the imaging 
sensor. A main effect of quantization noise in images 
is that there exist a number of gray levels with zero, 
or almost zero, frequency of occurrence, while gray 
levels in their vicinity possess high frequencies. This 
is due to the fact that a gray level g in a digital image 
can be either (g+1) or (g-l) without any appreciable 
change in the visual perception. 

An intuitive and heuristic approach to the model- 
ling of the aforementioned sources of uncertainty in 
the context of A-IFSs was proposed by Vlachos & 
Sergiadis (Vlachos & Sergiadis, 2005) (Vlachos & 
Sergiadis, 2007 d) for gray-scale images, while an 
extension to color images was presented in Vlachos & 
Sergiadis (Vlachos & Sergiadis, 2006). The underlying 
idea involves the application of the concept of the fuzzy 
histogram of an image, which models the notion of the 
gray level "approximately g". The fuzzy histogram 
takes into account the frequency of neighboring gray 
levels to assess the frequency of occurrence of the gray 
level under consideration. Consequently, a quantitative 
measure of the quantization noise can be calculated as 
the normalized absolute difference between the ordinary 
(crisp) and fuzzy histograms. 
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Finally, to further incorporate the additional distor- 
tion factors into the calculation of hesitancy, parameters 
are employed that model the influence of the dynamic 
range suppression and the fact that lower gray levels 
are more prone to noise than higher ones. 

Analytic Modelling 

The analytic approach offers a more generic treat- 
ment to hesitancy modelling of digital images, since 



it does not require an a priori knowledge of the system 
characteristics, nor a particular pre-defined image 
acquisition model. Generally, it consists of sequential 
operations that primarily aim to optimally transfer the 
image from the pixel domain (PD) to the intuitionistic 
fuzzy domain (IFD), where the appropriate actions 
will be performed, using the fuzzy domain (FD) as 
an intermediate step. After the modification of the 
membership and non-membership components of the 
image in the IFD, an inverse procedure is carried out 




Figure 1. Overview of the analytic IFIP framework 
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Figure 2. The process offuzzification (from image properties to membership functions) 
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for transferring the image back to the PD. A block 
diagram illustrating the analytic modelling is given in 
Figure 1 . Details on each of the aforementioned stages 
of IFIP are provided below. 

Fuzzification 

It constitutes the first stage of the IFIP framework, 
which assigns degrees of membership to image pixels 
with respect to an image property, such as "brightness", 
"homogeneity", or (( edginess". These properties are 
application dependent and also determine the opera- 
tions to be carried out in the following stages of the 
IFIP framework. For the task of contrast enhancement 
one may consider the "brightness" of gray levels and 
construct the corresponding FS "Bright pixel" or "Dark 
pixel" using different schemes that range from simple 
intensity normalization to more complex approaches 
involving knowledge extracted from a group of human 
experts (Figure 2). 

Intuitionistic Fuzzification 

Intuitionistic fuzzification is one of the most important 
stages of the IFIP architecture, since it involves the 
construction of the A-IFS that represents the image 
properties in the IFD. The analytic approach allows 
for an automated modelling of the hesitancy carried by 
image pixels, by rendering image properties directly 
from the FS obtained in the fuzzification stage through 
the use of intuitionistic fuzzy generators (Bustince, 
Kacprzyk & Mohedano, 2001). In order to construct 
an A-IFS that efficiently models a particular image 
property, tunable parametric intuitionistic fuzzy gen- 
erators are utilized. 

The underlying statistics of images are closely 
related to and soundly affect the process of hesitancy 
modelling. Different parameter values of the intuition- 
istic fuzzy generators produce different A-IFS s and 
therefore alternative representations of the image in 
the IFD are possible. Consequently, an optimization 
criterion should be employed, in order to select the pa- 
rameter set that derives the A-IFS that optimally models 
the hesitancy of pixels from the multitude of possible 
representations. Such a criterion, that also encapsulates 
the image statistics, is the intuitionistic fuzzy entropy 
(Burillo & Bustince, 1 996) (Szmidt & Kacprzyk, 200 1 ) 
of the image under consideration. Therefore, the set of 
parameters that produce the A-IFS with the maximum 



intuitionistic fuzzy entropy is considered as optimal. 
We refer to this process of selection as the maximum 
intuitionistic fuzzy entropy principle (Vlachos & Sergia- 
dis, 2007 d). The optimal parameter set is then used to 
construct membership and non-membership functions 
corresponding to the intuitionistic fuzzy components of 
the image in the IFD. This procedure is schematically 
illustrated in Figure 3. 

Modification of Intuitionistic Fuzzy 
Components 

It involves the actual processing of the intuitionistic 
fuzzy components of the image with respect to a par- 
ticular property. Depending on the desired image task 
one is about to perform, suitable intuitionistic fuzzy 
operators are applied to both membership and non- 
membership functions. 

Intuitionistic Defuzzification 

After obtaining the modified intuitionistic fuzzy compo- 
nents of the image, it is required that these components 
should be combined to produce the processed image 
in the FD. This procedure involves the embedding of 
hesitancy into the membership function. To carry out 
this task, we utilize suitable parametric intuitionistic 
fuzzy operators that de-construct an A-IFS into an FS. 
It should be stressed out that the final result soundly de- 
pends on the selected parameters of the aforementioned 
operators. Therefore, optimization criteria, such as the 
maximization of the index offuzziness of the image, are 
employed to select the overall optimal parameters with 
respect to the considered image operation. 

Defuzzification 

The final stage of the IFIP framework involves the 
transfer of the processed fuzzy image into the PD. 
Depending on the desired image operation, various 
functions may be applied to carry out this task. 

Applications 

The IFIP architecture has been successfully applied to 
many image processing problems. Vlachos & Sergiadis 
(2007 d) exploited the potential of the framework in 
order to perform contrast enhancement to low-con- 
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trasted images. Different approaches were introduced, 
namely the intuitionistic fuzzy contrast intensification 
and the intuitionistic fuzzy histogram hyperbolization 
(IFHH). An extension of the IFHH technique to color 
images was proposed in Vlachos & Sergiadis (Vlachos 
& Sergiadis, 2007 b). Additionally, the effects of em- 
ploying different intuitionistic fuzzification and intu- 
itionistic defuzzification schemes to the performance 
of contrast enhancement algorithms was thoroughly 
studied and investigated in Vlachos & Sergiadis (2007) 



(2007 d) and (2006 b), respectively. Application of A- 
IFSs theory to edge detection was also demonstrated 
in Vlachos & Sergiadis (Vlachos & Sergiadis, 2007 
d), based on intuitionistic fuzzy similarity measures. 
The problem of image thresholding and segmentation 
under the context of IFIP, was also addressed (Vlachos 
& Sergiadis, 2006 a) using novel intuitionistic fuzzy 
information measures. Under the general framework 
of IFIP, the notions of the intuitionistic fuzzy histo- 
grams of a digital image were introduced (Vlachos 




Figure 3. The process of intuitionistic fuzzification 
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& Sergiadis, 2007 c) and their application to contrast 
enhancement was demonstrated (Vlachos & Sergiadis, 
2007 a). Finally, the IFIP architecture was successfully 
applied in mammographic image processing (Vlachos 
& Sergiadis, 2007 d). Figure 4 illustrates the stages of 
IFIP in the case of the IFHH approach. 



FUTURE TRENDS 

Even though higher-order FSs have been widely 
applied to decision-making and pattern recognition 
problems, it seems that their application in the field of 
digital image processing is just beginning to develop. 
As a newly-introduced approach, the IFIP architecture 
remains a suggestively and challenging open field for 
future research. Therefore, it is expected that the IFIP 
framework will attract the interest of theoreticians and 
practitioners in the near future. 

The proposed IFIP context bases its efficiency in the 
ability of A-IFSs to capture and render the hesitancy 
associated with image properties. Consequently, the 
analysis and synthesis of images in terms of elements 
of A-IFSs theory plays a key role in the performance 
of the framework itself. Therefore, the stages of intu- 
itionistic fuzzification and defuzzification need to be 
further studied from an application point of view, to 
provide meaningful ways of extracting and embedding 
hesitancy from and to images. Finally, the IFIP archi- 
tecture should be extended to image processing task 
handled today by FS theory, in order to investigate and 
evaluate its advantages and particular merits. 



CONCLUSION 

This article describes an intuitionistic fuzzy architecture 
for the processing of digital images. The IFIP framework 
exploits the potential of A-IFSs to efficiently model the 
uncertainties associated with image pixels, as well as 
with the definitions of their properties. The proposed 
methodology provides alternative approaches for ana- 
lyzing/synthesizing images to/from their intuitionistic 
fuzzy components. Application of the IFIP framework 
to diverse imaging domains demonstrates its efficiency 
compared to traditional image processing techniques. 
It is expected that the proposed context will provide 
theoretician and practitioners with an alternative and 



challenging way to perceive and deal with real-world 
image processing problems. 
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KEY TERMS 

Crisp Set: A set defined using a characteristic func- 
tion that assigns a value of either or 1 to each element of 
the universe, thereby discriminating between members 
and non-members of the crisp set under consideration. 
In the context of fuzzy sets theory, we often refer to 
crisp sets as "classical" or "ordinary" sets. 

Defuzzification: The inverse process of fuzzifica- 
tion. It refers to the transformation of fuzzy sets into 
crisp numbers. 

Fuzzification: The process of transforming crisp 
values into grades of membership corresponding to 
fuzzy sets expressing linguistic terms. 

Fuzzy Logic: Fuzzy logic is an extension of tradi- 
tional Boolean logic. It is derived from fuzzy set theory 
and deals with concepts of partial truth and reasoning 
that is approximate rather than precise. 

Fuzzy Set: A generalization of the definition of 
the classical set. A fuzzy set is characterized by a 
membership function, which maps the members of 
the universe into the unit interval, thus assigning to 
elements of the universe degrees of belongingness 
with respect to a set. 

Image Processing: Image processing encompasses 
any form of information processing for which the input 
is an image and the output an image or a corresponding 
set of features. 

Intuitionistic Fuzzy Index: Also referred to as 
"hesitancy margin" or "indeterminacy index". It rep- 
resents the degree of indeterminacy regarding the as- 
signment of an element of the universe to a particular 
set. It is calculated as the difference between unity 
and the sum of the corresponding membership and 
non-membership values. 

Intuitionistic Fuzzy Set: An extension of the fuzzy 
set. It is defined using two characteristic functions, 
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the membership and the non-membership that do not function of crisp sets. In fuzzy logic, it represents the 

necessarily sum up to unity. They attribute to each degree of truth as an extension of valuation. 

individual of the universe corresponding degrees of , . ^ 

, ! j 1 1 vi Non-Membership Function: In the context of 

belongmgness and non-belongmgness with respect to ...... n 

,1,1 -i Atanassov s intuitionistic fuzzy sets, it represents the 

the set under consideration. _ _ . _ _ ; : _ ' . r n 

degree to which an element of the universe does not 

Membership Function: The membership function belong to a set. 

of a fuzzy set is a generalization of the characteristic 
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INTRODUCTION 

The success of the organisations is increasingly depen- 
dant on the knowledge they have, to the detriment of 
other traditionally decisive factors as the work or the 
capital (Tissen, 2000). This situation has led the organi- 
sations to pay special attention to this new intangible 
item, so numerous efforts are being done in order to 
conserve and institutionalise it. 

The Knowledge Management (KM) is a recent 
discipline replying this increasing interest; however, 
and despite its importance, this discipline is currently 
in an immature stage, as none of the multiple existing 
proposals for the development of Knowledge Manage- 
ment Systems (KMS) achieve enough detail for perform 
such complex task. 

In order to palliate the previous situation, this work 
presents a methodological framework for the explicit 
management of the knowledge. The study has a formal 
basis for achieving an increased level of detail, as all 
the conceptually elements needed for understanding and 
representing the knowledge of any domain are identi- 
fied. The requested descriptive character is achieved 
when basing the process on these elements and, in this 
way, the development of the systems could be guided 
more effectively. 



BACKGROUND 

During the last years numerous methodological frame- 
works for the development of KMS have arisen, the 



most important of which are the ones of Junnarkar 
( 1 997), Wiig et al ( 1 997), Daniel et al ( 1 997), Holsapple 
and Joshi ( 1 997), Liebowitz and Beckman (Liebowitz, 
1998; Beckman, 1997), Stabb and Schnurr (1999), Ti- 
wana (2000) and Mate et al (2002). Nevertheless, the 
existing proposals do not satisfy adequately the needs 
of the organisation knowledge (Rubenstein-Montano, 
2001; Andrade, 2003) due to their immaturity, mainly 
based on the following aspects: 

1. The research efforts have been mainly focused 
on the definition of a process for KMS develop- 
ment, ignoring instead the study of the object to 
be managed: the knowledge. 

2. The definition of such process has eluded in most 
of the cases the human factor and it has been 
restricted only to the technological viewpoint of 
the KM. 

The first aspect regards the necessary study of the 
knowledge as basis for the definition of the Corporate 
Memory structure; this study should identify (i) the type 
of knowledge that has to be included in that repository 
and (ii) their descriptive properties for the Corporate 
Memory to include all the features of the knowledge 
items that it stores. The definition of that structure would 
enable also the definition of a descriptive process for 
creating KMS by using the different characteristics and 
types of knowledge. 

However, and despite the influence that the object to 
be managed has on the management process, only the 
Wiig (1997) proposal pays attention to its study. Such 
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proposal identifies a small set of descriptors that support 
the formalisation (making explicit) of the knowledge 
although, (i) its identification does not result from an 
exhaustive study and (ii) it does not enable a complete 
formalisation as it is solely restricted to some generic 
properties. 

The second step suggests that the whole process for 
KMS development should consider the technological 
as well as the human vision. The first one is focused on 
how obtaining, storing and sharing the relevant knowl- 
edge that exists within an organisation, by creating the 
Corporate Memory and the computer support system. 
The second vision involves, not only the creation of 
a collaborative atmosphere within the organisation in 
order to achieve the involvement of the workers in 
the KM program, but also the tendency to share their 
knowledge and use the one already provided by other 
members. 

Despite the previous fact, the vast majority of the 
analysed approaches are solely focused on the techno- 
logical KM viewpoint, which jeopardises the success of 
a KMS (Andrade, 2003). In fact, among the previously 
mentioned proposals, only the Tiwana (2000) proposal 
explicitly considers the human viewpoint by including 
a specific phase for it. 

As a result of both aspects, the current proposals are 
restricted to a set of generic guides for performing KM, 
which is quite different from the formal and detailed 
vision that is being demanded. In other words, the cur- 
rent approaches indicate what to do but not how to do it 
(prescriptive viewpoint against descriptive/procedural 
viewpoint). In this scenario the developers of this type 
of systems have to elaborate their own ad hoc approach, 
achieving results that only depend on the experience 
and the capabilities of the development team. 



DEVELOPMENT FOR KNOWLEDGE 
MANAGEMENT SYSTEMS 

This section presents a methodological framework for 
the explicit KM that solves the previously mentioned 
problems. A study of the obj ect to be managed has been 
performed for obtaining a knowledge formalisation 
schema, i.e., for knowing the relevant knowledge items 
and the characteristics/properties that should be made 
explicit. Using the results achieved after this study a 
methodological framework for KMS creation has been 
defined. Both aspects are following discussed. 



Proposed Formalisation Schema 

The natural language is the language par excellence 
for sharing knowledge. Due to this, a good identifica- 
tion of all the necessary elements for conceptualising 
(understanding) the knowledge of any domain (and 
therefore those for whom the respective formalisation 
mechanisms must be provided) can be done from the 
analysis of the different grammatical categories of the 
natural language: nouns, adjectives, verbs, adverbs, 
locutions and other linguistic expressions. This study, 
whose detailed description and applications have been 
described in several works (Andrade, 2006; Andrade, 
2008), reveals that all the identified conceptual ele- 
ments can be put into the following knowledge levels 
according to their function within the domain: 

Static. It regards the structural or operative knowl- 
edge domain, meaning domain facts that are true 
and that can be used in some operations as con- 
cepts, properties, relationships and constraints. 
Dynamic. It is related to the performance of the 
domain, that is, functionality, action, process 
or control: inferences, calculations and step 
sequence. This level can be divided into two 
sublevels: 

Strategic. It includes what to do, when and in 
what order (i.e., step factorisation). 
Tactical. It specifies how and when obtaining 
new operative knowledge (i.e., the description 
of a given step). 

Every one of these levels approaches a different 
fragment of the organisation knowledge, although they 
all are obviously interrelated; in fact, the strategic level 
controls the tactical one, as for every last level/elemental 
step (strategic knowledge) the interferences and cal- 
culi must be indicated (tactical knowledge). Also the 
level of the operative knowledge is controlled by the 
other two, as it specifies how, not only the bifurcation 
points or execution alternatives are decided (strategic 
knowledge), but also how interferences and calculi are 
done (tactical knowledge). 

Therefore, a KMS must provide support to all these 
levels. As it can be observed at Table 1, the main for- 
malisation schema has been divided, on one hand, into 
several individual schemas corresponding to each one 
of the identified knowledge levels and, on the other, into 
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Table 1. Components defined for every identified schema 



Schemas 


Components 


Common 


Catalogue of terms 


Dynamic 


Strategic 


Catalogue of non terminal steps 


Catalogue of terminal steps 


Tactical 


Catalogue of tactical steps 


Static 


Operative 


Catalogue of concepts 


Catalogue of relationships 


Catalogue of properties 




a common one for the three levels, providing the global 
vision of the organisation knowledge. Therefore, the 
knowledge formalisation involves a dynamic schema 
including the strategic and tactical individual schemas, 
a dynamic schema including an operative schema, and 
a common schema, for describing the common aspects 
regardless the level. Every individual schema is also 
constituted by some components. 

The catalogue of terms is a common component for 
the schemas, providing synonyms and abbreviations for 
identifying every knowledge asset within the organi- 
sation. The strategic schema describes the functional 
splitting of every KMS operation and also each identified 
step. As the description varies when the step is termi- 
nal or not (elemental step), two different components 
are needed for including all the characteristics of this 
level. The approach-procedural or algorithmic, for 
instance-should be described with detail for every asset 
included into the catalogue of terminal steps. All this 
information is included at the catalogue of tactical steps. 
Lastly, the static schema is made up of the catalogue 
of concepts-including the identified concepts and their 
description- the catalogue of relationships-describing 
the identified relationships and their meaning-and the 
catalogue of properties-referring the properties of the 
previously mentioned concepts and relationships-. 

The detailed description of this study, together with 
the descriptors of every component, can be found in 
(Andrade, 2008). 



PROPOSED METHODOLOGICAL 
FRAMEWORK 

The proposed process, whose basic structure is shown 
in Figure 1 , has been elaborated bearing in mind the 
problems detected at the KM discipline and already 
mentioned throughout the present work. 

As it can be noticed in the previous figure, this 
process includes the following phases: 

1 . Setting-up of the KM commission: the direction 
defines a KM commission for tracking and per- 
forming the KM project. 

2. Scope identification. The problem to be ap- 
proached would be specified by means of deter- 
mining on where the present cycle of the KM 
project must have a bearing. In order to achieve 
this, the framework proposes the use of the SWOT 
analysis (Strengths, Weaknesses, Opportuni- 
ties, Tricks), together with the proposal of Zack 
(1999). 

3. Knowledge acquisition, including: 

3.1. Identification of knowledge domains. The 
knowledge needs regarding the approached subject 
area are determined by means of different meetings 
involving the development team, the KM committee 
and the people responsible of every operation to be 
performed. 

3.2. Capture of the relevant knowledge. The ob- 
taining of all the possible knowledge related with the 
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operation approached is based on the identified domains. 
This is done by means of: 

(a) Identifying where the knowledge lies in. The 
KM commission is in charge of identifying and 
providing the human and non human knowledge 
sources that are going to be analysed. 

(b) Determining the knowledge that has to be cap- 
tured. As in the previous epigraph, it should be 
necessary to bear in mind the strategic, tactical 
and operative knowledge. 

(c) Knowledge obtaining. 

Obviously, when all the knowledge that is needed 
does not exist at the organisation it should be gener- 
ated or imported. 



4. Knowledge assimilation, comprising: 

4.1. Knowledge conceptualisation. Its goal is the 
comprehension of the captured knowledge. It is rec- 
ommended to start with the strategic knowledge for 
subsequently focusing on the tactical knowledge. As 
the strategic and tactical elements are understood, it 
would be necessary to assimilate arisen elements of 
the operative level. 

4.2. Knowledge representation. The relevant 
knowledge has to be made explicit and formalised, 
according to the components (Andrade, 2008) sum- 
marised at Table 1. This in one of the main distinguishing 



points of the proposal presented here, as the proposed 
formalisation schema indicates the specific descriptors 
needed for a correct and complete formalisation of the 
knowledge. 

5. Knowledge consolidation, including: 

5.1. Knowledge verification. In order to detect 
failures and omissions related with the represented 
knowledge it should be considered: 

(a) Generic aspects. It has to be checked that any 
knowledge element (strategic, tactical and opera- 
tive) is included into the catalogue of terms, that 
any term included there has been made explicit 
according to the type of knowledge and, that all 
the fields are completed. 

(b) Strategic aspects. It should be verified that (i) 
any decision regarding an execution is made 
according to the existing operative knowledge, 
(ii) any last level step is associated to an existing 
tactical knowledge and, (iii) any non terminal step 
is correctly split. All the previous facts would be 
achieved by checking the accordance between 
the split tree and the content of the formalisation 
schema of the terminal strategic knowledge. 

(c) Tactical aspects. It should be verified that: (i) the 
whole of the tactical knowledge is used in some 
of the last level steps of the strategic knowledge 
and that any operative knowledge related to the 
tactical knowledge is available. In order to achieve 
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this, the operative knowledge items will be rep- 
resented as nodes within a knowledge map. This 
type of maps enable the graphic visualisation of 
how new elements are obtained from the existing 
ones. Once the map has been done, it should be 
scoured for checking that the whole of the opera- 
tive knowledge has been included, 
(d) Operative aspects. It should be confirmed that: 
(i) there are not isolated concepts, (ii) there 
are not attributes unrelated to a concept or to a 
relationship, (iii) there are not relationships as- 
sociating non existing concepts or relationships 
and (iv) the whole of the operative knowledge 
is used in some of the tactical knowledge and/or 
in the decision making of the flow control of 
the strategic knowledge. In order to perform the 
three first verifications, a relationships diagram 
will be elaborated for graphically showing the 
existing relationships among the different ele- 
ments of the operative knowledge. The syntax 
of this type of diagrams is analogous to the one 
of the class diagrams used in the methodologies 
of object-oriented software development. The 
verification of the last proposal will be done by 
using a knowledge map; the execution structures 
included into the content of the formalisation 
schema for the strategic knowledge of last level 
related to every process (the remaining inferior 
levels are included into the superior one) will 
be also used in this verification. With these two 
mentioned graphic representations it could be 
verified that every operative element is included 
into at least one of the representations. 

5.2. Knowledge validation. In order to verify the 
knowledge represented and verified, the development 
team, the KM commission and the involved parts will 
revise: 

(a) The knowledge splitting tree 

(b) The knowledge map 

(c) The relationship diagram 

(d) The functional splitting tree 

(e) The content of the formalisation schema 

6. Creation of the support system, which is divided 
into: 



6.1. Definition of the incorporation mechanisms. 
The KM commission and the development team deter- 
mine the adequacy of the incorporation type (passive, 
active or their combination) according to criteria such 
as financial considerations or stored knowledge. 

6.2. Definition of the notification mechanisms. 
The KM commission and the development team will 
establish the most suitable method for notifying the 
newly included knowledge. The notification can be 
passive or active; even the absence of notification could 
be considered. 

6.3. Definition of the mechanisms for knowledge 
localisation. Several alternatives, such as the need of 
including intelligent searches or meta-searches, are 
evaluated. 

6.4. Development of the KM support system. It will 
be necessary to define and to implement the corporate 
memory, the communication mechanisms and the ap- 
plications for collaboration and team work. 

6.5. Population of the corporate memory. Once 
the KM system has been developed. The knowledge 
captured, assimilated and consolidated will be included 
into the corporate memory. 

7. Creation of the collaboration environment. The 
main goal of this phase is to promote and to 
improve the contribution of knowledge and its 
subsequent use by the organisation. It should be 
borne in mind the risk that involves the use of an 
unsuitable organisation culture or of inadequate 
tools for promotion and reward. The following 
strategies should be followed instead: 

Considering the employee worth according his/her 
knowledge contribution to the organisation 
Supporting and awarding the use of the organi- 
sational existing knowledge 
Promoting the relaxed dialogue among employees 
from different domains 

Promoting a good atmosphere among the em- 
ployees 
Committing all the employees 



FUTURE TRENDS 

As it has been indicated, the KM discipline remains in 
an immature stage due to an inadequate viewpoint: the 
absence of a strict study for determining the relevant 
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knowledge and the characteristics that should be sup- 
ported. Such situation has led to an important detail 
shortage of the existing proposals for KMS develop- 
ment, currently dependant solely from the individual 
good work of the developers. 

The present proposal means a new viewpoint for 
developing this type of systems. However, it still 
remains a lot to do. As the authors are aware of the 
high grade of bureaucracy that might be needed for 
specifically following the present proposal, it should 
be expedited and characterised for specific domains. 
Nevertheless, this viewpoint could be considered as 
the key for achieving specific ontologies for KM in 
every domain. 



CONCLUSION 

This article has presented a methodological framework 
for the development of KMS that, differently from 
the existing proposals, is based on the strict study of 
the knowledge to be managed. This characteristic has 
provided the system with a higher procedural level 
of detail than the current proposals, as the elements 
conceptually needed for understanding and represent- 
ing the knowledge of any domain have been identified 
and formalised. 
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KEY TERMS 

Commission of Knowledge Management: Team 
in charge of the Knowledge Management project. 

Corporate Memory: Physical and persistent stor- 
age of the knowledge in an organisation. Its structure is 
determined by the knowledge formalisation schema. 

Knowledge: Pragmatic level of information result- 
ing from the combination of the information received 
with the individual experience. 

Knowledge Formalisation Schema: Set of attri- 
butes for describing and formalising the knowledge. 



Knowledge Management: Discipline that tries to 
suitably provide the adequate information and knowl- 
edge to the people indicated, whenever and how they 
need them. In such way these people will have all the 
necessary elements for best performing their tasks. 

Knowledge Management System: System for 
managing knowledge in organizations, supporting 
the addition, storage, notification and localization of 
expertise and knowledge. 

Methodological Framework: Approach for 
making explicit and structuring how a given task is 
performed. 
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INTRODUCTION 

The Knowledge Management (KM) is a recent disci- 
pline that was born under the idea of explicitly managing 
the whole existing knowledge of a given organisation 
(Wiig, 1995) (Wiig et al., 1997). More specifically, the 
KM involves providing the people concerned with the 
right information and knowledge at the most suitable 
level for them, when and how best suit them; in such 
way, these people will have all the necessary ingredients 
for choosing the best option when faced with a specific 
problem (Rodriguez, 2002). 

As the knowledge, together with the ability for its 
best management, has turned into the key factor for the 
organizations to stand out, it is desirable to determine 
and develop the support instruments for the generation 
of such value within the organisations. This situation 
has been commonly accepted by several authors as 
(Brooking, 1996) (Davenport & Prusak, 2000) (Huang 
et al., 1999) (Liebowitz & Beckman, 1998) (Nonaka 
& Takeuchi, 1995) and (Wiig, 1993) among others. 
Technological tools should be available for diminish- 
ing the communication distance and for providing a 
common environment where the knowledge might 
accessible for being stored or shared. 

As KM is a very recent discipline, there are few 
commercial software tools that deal with those aspects 
necessary for its approach. Most of the tools classified 
as KM-related are mere tools for managing documents, 
which is unsuitable for the correct management of the 
organisations knowledge. Bearing such problem in 
mind, the present work approaches the establishment 



of a KM support software tool based on the own defi- 
nition of KM and on the existing tools. For achieving 
this, section 2 presents the market analysis that was 
performed for studying the existing KM tools, where 
not only their characteristics were analysed, but also 
the future needs of the knowledge workers. Following 
this study, the functionality that a KM support tool 
should have and the proposal for the best approach to 
that functionality were identified. 



BACKGROUND 

The first step for developing a complete KM support 
tool according to the present and future trade needs is 
the performance of a study of the existing market. After 
the initial identification of the characteristics that a KM 
support tool should have, a posterior work reveals how 
the studied tools provide support to every one of the 
previously identified characteristics. Lastly, an evalu- 
ation of the obtained results will be performed. 

Characteristics to be Considered 

The previously mentioned definition of KM was the 
basis for the identification of the characteristics to be 
considered, bearing in mind the different aspects that 
should be supported by the tool. 

A KM tool should give support to the following 
aspects (Andrade et al., 2003a): 
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Corporate Memory 

Yellow Pages 

Collaboration and Communication mechanisms 

Corporate Memory 

The Corporate Memory compiles the knowledge 
that exists within an organisation for its workers 
disposal (Stein, 1995) (Van Heijst et al., 1997). 
Due to this, to compile and to make the relevant 
knowledge explicit is equally important than 
providing the suitable mechanisms for its correct 
and easy location, as well as recuperation. 
Yellow Pages 

A KM program should not make the mistake of 
trying to capture and represent the whole existing 
knowledge of the organisation, as this attempt 
would not be feasible; in this sense, the relevant 
knowledge for the performance of the organisa- 
tion should be the one to be included. However, 
not making all the knowledge explicit does not 
mean that it has to be obviated; for that reason, it 
is important to determine which knowledge has 
every individual at the organisation by means 
of the elaboration of the Yellow Pages. These 
ones identify and publish additional knowledge 
sources, human and non-human, that are at the 
organisation disposal (Davenport & Prusak, 
2000). 



3. Collaboration and Communication Mecha- 
nisms 

At the organisations the knowledge is share, as 
well as distributed, regardless of the automatism, 
or not, of the process. A knowledge transfer occurs 
every time that an employee asks a workmate of 
the adjoining office how to perform a given task. 
These daily knowledge transfers made the routine 
of the organisation up but, as they are local and 
fragmentary, some systems for user collabora- 
tion and communication should be therefore 
established. An adequate KM support tool should 
include mechanisms that guarantee the efficiency 
of the collaboration and the communication, re- 
gardless of the physical or temporal location of 
the interlocutors. 

Analysed Tools 

Once the aspects that a KM support tool should con- 
sider have been identified, the following step involves 
analysing how the current tools consider them. 

With such purpose, the main so-named KM support 
tools that exist currently were analysed, discarding 
certain tools such as information search engines or 
simple applications for documents management, as 
they merely offer partial solutions. 




Table 1. Tools analysed 
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The analysis included thirteen tools (Table 1), all 
of them approaching at least two of the previously 
mentioned aspects. It should be highlighted that all 
the tools implement the Corporate Memory as a docu- 
ment warehouse, while the Yellow Pages appear as a 
telephone directory. 

Results Evaluation 



the adequate knowledge that the user might need at a 
given moment. Therefore, and as it has been pointed 
previously, for the best use of the knowledge, it should 
be somehow structured. The communication supports 
are also quite important. 

The characteristics of a KM support tool should 
be then necessarily defined, together with a guide for 
approaching them. 



After the tools were analysed it was noticed that, for 
every aspect considered, there are some common ele- 
ments. Bearing in mind these elements and the current 
needs, table 2 shows the desirable characteristics that 
a KM support tool should have. 

The conclusions drawn after a deeper study on how 
the analysed tools approach the desirable characteristics 
are following presented. 

Firstly it was observed that none of the tools clas- 
sified as KM ones has the necessary structure for 
best identifying, formalising and sharing the relevant 
knowledge, as they solely perform documental man- 
agement complemented, in the best of the cases, by 
some descriptive fields, the association to a contents 
tree or by means of links to another related documents. 
Such fact creates many problems, especially and due 
to the great data volume, the difficulty for selecting 



RECOMMENDED FEATURES 

The approach to every one of the detected character- 
istics should be initiated as soon as the functionality 
that a support tool for the explicit management of the 
corporative knowledge might have been determined. 

1. Corporate Memory: the organisation knowledge 
has to be physically stored somehow by means of a 
Corporate Memory for being adequately shared. A 
Corporate Memory is an explicit, independent and 
persistent knowledge representation (Stein, 1 995) 
(Van Heijst et al., 1997) that can be considered 
as a knowledge repository from the individuals 
that work at a given organisation. The Corporate 
Memory should include the following aspects: 



Table 2. Desirable characteristics of a KM support tool 
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1.1. Knowledge formalisation. Before being 
included into the Corporate Memory, the 
knowledge has to be formalised by means 
of the determination of, not only the relevant 
knowledge, but also the attributes that de- 
scribe it. When performing this formalisa- 
tion it should be born in mind that there 
are two types of knowledge; on one hand, 
the Corporate Memory must include the 
knowledge needed to describe the opera- 
tions for performing an organisational task. 
On the other side, it is necessary to capture 
the knowledge that has been acquired by the 
individuals after their experience and life 
time. This markedly heuristic knowledge 
is known as Learned Lessons', positive as 
well as negative experiences that can be 
used for improving the future performance 
of the organisation (Van Heijst, 1997), and 
therefore refining its current knowledge. 

a. Organisational knowledge (Andrade 
et al., 2003b). A KM system should 
consider different types of knowledge 
when structuring the relevant knowl- 
edge associated to the operations that 
exist at the organisation: 

Strategic or control knowledge : 
it indicates, not only what to do, 
but also why, where and when. 
For that reason, the constituents 
of the functional disintegration 
of every operation should be 
identified. 

Tactical: it specifies how and 
under what circumstances the 
tasks are done. This type of 
knowledge is associated with 
the execution process of every 
last-level strategic step. 

b. Learned lessons. It is related to the 
experience and the knowledge that 
the individuals have with regards to 
their task. It provides the person who 
possesses it with the ability for refining 
both, the processes that follows at work 
and the already existing knowledge, 
in order to be more efficient. Whereas 
it's appropriate to create systems of 
learned lessons (Weber, 200 1 ) in order 
to save this type of knowledge. 



1.2. Incorporation mechanisms . The knowledge 
can be incorporated in an active or passive 
way (Andrade et al., 2003c). The active 
incorporation is based on the existence of 
a KM group in charge of looking after the 
quality of the knowledge that is going to be 
incorporated. This guarantees the quality of 
the knowledge included into the Corporate 
Memory but it also takes human resources 
up. Differently from the previous way, at 
the passive incorporation does not exist any 
group for quality evaluation, as the own 
individual ready to share knowledge and 
experience will be responsible for evaluat- 
ing that the proposal fulfils the minimum 
requirements of quality and relevancy. The 
main advantage of the second alternative 
is that it does not take additional resources 
up. Bearing in mind the previous consider- 
ations, the active knowledge incorporation 
is preferred whenever it might be possible, 
as in such way the quality and the relevancy 
of the knowledge will be guaranteed. 

1.3. Notification mechanisms. All the members 
of the organisation should be informed when 
a new knowledge is incorporated as this 
enables the refinement of their knowledge. 
The step previous to the notification is the 
definition of the group of people tan will 
be informed of the new appearance of a 
knowledge item. There are two alternatives 
(Garcia et al., 2003): subscription, where 
every individual at the organisation might 
take out a subscription to certain preferred 
specific issues, and spreading, where the 
notification messages reach the workers 
without previous request. At the spreading, 
the messages can be sent to all the members 
of the organisation, but this is not advisable 
as the receptor would be not able of discern 
which ones of the vast amount of messages 
received might be interesting for him/her. 
Other spreading possibility would rely on 
an individual or a group that would be in 
charge of determining the addressees for 
every given message; this last option is 
quite convenient for the members of the 
organisation but it takes up a vast amount 
of resources that have to contain themselves 
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a lot of information regarding the interests 
of every one of the members. 
1.4. Localisation mechanisms. The tool should 
be provided with some search mechanism 
in order to achieve the maximum possible 
profit from the captured and incorporated 
knowledge (Tiwana, 2000). It is necessary 
to reach an agreement between efficiency 
and functionality, as enough search options 
should be available without increasing the 
system complexity. For this reason, the fol- 
lowing search mechanisms are suggested: 
Hierarchy search: this search cata- 
logues the knowledge into a fixed 
hierarchy, in such way that the user 
might move through a group of links 
for refining the search performed. 
Attribute search: is based on the 
specification of terms in which the 
user is interested, resulting into some 
knowledge elements that might con- 
tent those terms. This type of search 
provides more general results than the 
previous one. 

2. Yellow Pages: a KM system should not try to 
capture and assimilate the whole of the knowledge 
that exists at the organisation as it would not be 
feasible. Therefore, the Yellow Pages are used for 
including, not only the systems that store knowl- 
edge, but also the individuals that have additional 
knowledge. Their elaboration is performed after 
determining the knowledge possessed by every 
individual at the organisation or by any other non 
human agents. 

3. Collaboration and communication mechanisms: 
at the organisations, the knowledge is shared 
and distributed regardless the process might 
be automated or not. The technology helps the 
interchange of knowledge and ideas among the 
members of the organisation, as it enables bring- 
ing the best possible knowledge within reach of 
the individual who requires it. The collaboration 
and communication mechanisms detected are the 
following: 

3.1 Asynchronous communication. Does not 

require the connection between the ends 

of the communication at the same time. 

• E-mail. The electronic messenger 

enables the interchange of text and/or 



any other type of document among two 
or several users 

Forum. It consists of a Web page where 
the participants leave questions that do 
not have to be answered at that very 
moment. Other participants leave the 
answers which, together with the ques- 
tions, can be seen by anyone entering 
the forum at any moment. 
Suggestion box. It enables sending 
suggestions or comments of any rel- 
evant aspect of the organisation to the 
adequate person or department. 
Notice board. It is a common space 
where the members of the organisa- 
tion can publish some announcements 
appropriate for the public interest. 
3.2 Synchronous communication. This type 
of interactive technology is based on real- 
time communications. Some of the most 
important systems are the following: 

Chat. It implies the communication 
among several people through the 
computer, as all the people connected 
can follow the communication, ex- 
press an opinion, contribute ideas, 
make or answer questions when they 
decide. 

Electronic board. It provides the mem- 
bers of the organisation with a shared 
space for improving the interchange 
the ideas where everybody draws or 
writes. 

Audio conference. Two or more users 
can use real-time voice communica- 
tion. 
• Video conference. Two or more users 
can use real-time image communica- 
tion. 



FUTURE TRENDS 

As it has been mentioned before, there is not a current 
KM tool that might cover adequately the organisational 
needs. This problem has been approached in the present 
work by trying to determine the functionality that any 
of these tools should incorporate. This is a first step that 
should be complemented with subsequent works, as it 
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is necessary to go deeper and determine better how to 
approach and implement the specified aspects. 



CONCLUSION 

The knowledge, either for its management or not, 
is transmitted within the organisations, although its 
existence does not imply its adequate use. There is a 
vast amount of knowledge where access is extremely 
difficult; this means that there are items from where no 
return is being achieved and that they are lost into the 
organisation. The KM represents the effort for captur- 
ing and getting benefits from the collective experience 
of the organisation by means of turning it accessible 
to any of its members. However, it could be stated 
that not a current tool is able to efficiently perform 
this task as, although there exist the so-named KM 
tools, they merely store documents and none of them 
performs the structuration of the relevant knowledge 
for its best use. 

In order to palliate such problems, the present work 
proposes an approach based on a market research. It is 
as well based on the KM definition that indicates how 
to approach and defines the characteristics that a tool 
should have for working as facilitator of an adequate 
and explicit Knowledge Management. 
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KEY TERMS 

Communication & Collaboration Tool: Systems 
that enable collaboration and communication among 
members of an organisation (i.e. chat applications, 
whiteboards). 

Document Management: It is the computerised 
management of electronic, as well as paper-based 
documents. 

Institutional Memory: It is the physical storage of 
the knowledge entered in an organization. 

Knowledge: Pragmatic level of information that 
provides the capability of dealing with a problem or 
making a decision. 



Knowledge Management: Discipline that intends 
to provide, at its most suitable level, the accurate infor- 
mation and knowledge for the right people, whenever 
they may needed and at their best convenience. 

Knowledge Management Tool: Organisational 
system that connects people with the information and 
communication technologies, with the purpose of 
improving the share and distribution processes of the 
organisational knowledge. 

Lesson Learned: Specific experience, positive 
or negative, of a certain domain. It is obtained into a 
practical context and can be used during future activi- 
ties of similar contexts. 

Yellow Page: It storages information about a hu- 
man or non-human source that has additional and/or 
specialized knowledge about a particular subject. 
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INTRODUCTION 

The tools of artificial intelligence (AI) can be divided 
into two broad types: knowledge-based systems (KBSs) 
and computational intelligence (CI). KBSs use explicit 
representations of knowledge in the form of words 
and symbols. This explicit representation makes the 
knowledge more easily read and understood by a hu- 
man than the numerically derived implicit models in 
computational intelligence. 

KBSs include techniques such as rule-based, model- 
based, and case-based reasoning. They were among the 
first forms of investigation into AI and remain a major 
theme. Early research focused on specialist applications 
in areas such as chemistry, medicine, and computer 
hardware. These early successes generated great opti- 
mism in AI, but more broad-based representations of 
human intelligence have remained difficult to achieve 
(Hopgood, 2003; Hopgood, 2005). 



BACKGROUND 

The principal difference between a knowledge-based 
system and a conventional program lies in its structure. 
In a conventional program, domain knowledge is in- 
timately intertwined with software for controlling the 



application of that knowledge. In a knowledge-based 
system, the two roles are explicitly separated. In the 
simplest case there are two modules: the knowledge 
module is called the knowledge base and the control 
module is called the inference engine. Some interface 
capabilities are also required for a practical system, as 
shown in Figure 1 . 

Within the knowledge base, the programmer ex- 
presses information about the problem to be solved. 
Often this information is declarative, i.e. the program- 
mer states some facts, rules, or relationships without 
having to be concerned with the detail of how and 
when that information should be applied. These latter 
details are determined by the inference engine, which 
uses the knowledge base as a conventional program 
uses a data file. A KBS is analogous to the human 
brain, whose control processes are approximately 
unchanging in their nature, like the inference engine, 
even though individual behavior is continually modi- 
fied by new knowledge and experience, like updating 
the knowledge base. 

As the knowledge is represented explicitly in the 
knowledge base, rather than implicitly within the 
structure of a program, it can be entered and updated 
with relative ease by domain experts who may not have 
any programming expertise. A knowledge engineer is 
someone who provides a bridge between the domain 



Figure 1. The main components of a knowledge-based system 
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expertise and the computer implementation. The knowl- 
edge engineer may make use of meta-knowledge, i.e. 
knowledge about knowledge, to ensure an efficient 
implementation. 

Traditional knowledge engineering is based on 
models of human concepts. However, it has recently 
been argued that animals and pre-linguistic children 
operate effectively in a complex world without neces- 
sarily using concepts. Moss (2007) has demonstrated 
that agents using non-conceptual reasoning can outper- 
form stimulus-response agents in a grid-world test bed. 
These results may justify the building of non-conceptual 
models before moving on to conceptual ones. 



/* rulel */ 

if valve is open and flow is high then steam is escaping 

Part of the attraction of using production rules is 
that they can often be written in a form that closely 
resembles natural language, as opposed to a computer 
language. The facts in a KBS for boiler monitoring 
might include: 

/*factl*/ 
valve is open 

/*fact2 */ 

flow is high 



TYPES OF KNOWLEDGE-BASED 
SYSTEM 

Expert Systems 

Expert systems are a type of knowledge-based system 
designed to embody expertise in a particular specialized 
domain such as diagnosing faulty equipment (Yanga, 
2005). An expert system is intended to act like a human 
expert who can be consulted on a range of problems 
within his or her domain of expertise. Typically, the 
user of an expert system will enter into a dialogue in 
which he or she describes the problem - such as the 
symptoms of a fault - and the expert system offers 
advice, suggestions, or recommendations. It is often 
proposed that an expert system must offer certain ca- 
pabilities that mirror those of a human consultant. In 
particular, it is often stated that an expert system must 
be capable of justifying its current line of inquiry and 
explaining its reasoning in arriving at a conclusion. 
This functionality can be integrated into the inference 
engine (Figure 1). 

Rule-Based Systems 

Rules are one of the most straightforward means of 
representing knowledge in a KBS. The simplest type 
of rule is called a production rule and takes the form: 

if <condition>then <conclusion> 

An example production rule concerning a boiler 
system might be: 



One or more given facts may satisfy the condition of 
a rule, resulting in the generation of a new fact, known 
as a derived fact. For example, by applying rulel to 
factl and fact2, fact3 can be derived: 

/*fact3 */ 

steam is escaping 

The derived fact may satisfy the condition of another 
rule, such as: 

/* rule2 */ 

if steam is escaping or valve is stuck then outlet is blocked 

This, in turn, may lead to the generation of a new 
derived fact or an action. Rulel and rule2 are inter- 
dependent, since the conclusion of one can satisfy the 
condition of the other. The inter-dependencies amongst 
the rules define a network, as shown in Figure 2, known 
as an inference network. 

It is the job of the inference engine to traverse the 
inference network to reach a conclusion. Two important 
types of inference engine can be distinguished: forward- 
chaining and backward-chaining, also known as data- 
driven and goal-driven, respectively. A KBS working 
in data-driven mode takes the available information, 
i.e. the given facts, and generates as many derived facts 
as it can. In goal-driven mode, evidence is sought to 
support a particular goal or proposition. 

The data-driven (forward chaining) approach might 
typically be used for problems of interpretation, where 
the aim is to find out whatever the system can infer 
about some data. The goal-driven (backward chaining) 
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Figure 2. An inference network for a boiler system 
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approach is appropriate when a more tightly focused 
solution is required, such as the generation of a plan for 
a particular goal. In the example of a boiler monitoring 
system, forward chaining would lead to the reporting 
of any recognised problems. In contrast, backward 
chaining might be used to diagnose a specific mode 
of failure by linking a logical sequence of inferences, 
disregarding unrelated observations. 

The rules that make up the inference network in 
Figure 2 are used to link cause and effect: 

if <cause>then <effect> 

Using the inference network, an inference can be 
drawn that if the valve is open and the flow rate is high 
(the causes) then steam is escaping (the effect). This 
is the process of deduction. Many problems, such as 
diagnosis, involve reasoning in the reverse direction, 
i.e. the user wants to ascertain a cause, given an effect. 
This is abduction. Given the observation that steam is 
escaping, abduction can be used to infer that valve is 
open and the flow rate is high. However, this is only a 
valid conclusion if the inference network shows all of 
the circumstances in which steam may escape. This is 
the closed-world assumption. 

If many examples of cause and effect are available, 
the rule (or inference network) that links them can be 
inferred. For instance, if every boiler blockage ever 
seen was accompanied by steam escaping and a stuck 
valve, then rule2 above might be inferred from those 
examples. Inferring a rule from a set of example cases 
of cause and effect is termed induction. 



Hopgood (200 1 ) summarizes deduction, abduction, 
and induction as follows: 



deduction: cause + rule 


=^> effect 


abduction: effect + rule 


=^> cause 


induction: cause + effect 


=^> rule 



Logic Programming 

Logic programming describes the use of logic to es- 
tablish the truth, or otherwise, of a proposition. It is, 
therefore, an underlying principle for rule-based sys- 
tems. Although various forms of logic programming 
have been explored, the most commonly used one is 
the Prolog language (Bramer, 2005), which embodies 
the features of backward chaining, pattern matching, 
and list manipulation. 

The Prolog language can be programmed declara- 
tively, although an appreciation of the procedural be- 
havior of the language is needed in order to program 
it effectively. Prolog is suited to symbolic problems, 
particularly logical problems involving relationships 
between items. It is also suitable for tasks that involve 
data lookup and retrieval, as pattern-matching is fun- 
damental to the functionality of the language. 

Symbolic Computation 

A knowledge base may contain a mixture of numbers, 
letters, words, punctuation, and complete sentences. 
These symbols need to be recognised and processed 
by the inference engine. Lists are a particularly useful 
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data structure for symbolic computation, and they are 
integral to the AI languages Lisp and Prolog. Lists al- 
low words, numbers, and symbols to be combined in 
a wide variety of ways. A list in the Prolog language 
might look like this: 

[animal, [cat, dog], vegetable, mineral] 

where this example includes a nested list, i.e. a list within 
a list. In order to process lists or similar structures, the 
technique of pattern matching is used. For example, the 
above list in Prolog could match to the list 

[animal, [_,X], vegetable, Y] 

where the variables X and Y would be assigned values 
of dog and mineral respectively. This pattern matching 
capability is the basis of an inference engine's ability 
to process rules, facts and evolving knowledge. 

Uncertainty 

The examples considered so far have all dealt with 
unambiguous facts and rules, leading to clear conclu- 
sions. In real life, the situation can be complicated by 
three forms of uncertainty: 

Uncertainty in the Rule Itself 

For example, rule 1 (above) stated that an open valve and 
high flow rate lead to an escape of steam. However, if 
the boiler has entered an unforeseen mode, it made be 
that these conditions do not lead to an escape of steam. 
The rule ought really to state that an open valve and high 
flow rate will probably lead to an escape of steam. 

Uncertainty in the Evidence 

There are two possible reasons why the evidence upon 
which the rule is based may be uncertain. First, the 
evidence may come from a source that is not totally reli- 
able. For example, in rulel there may be an element of 
doubt whether the flow rate is high, as this information 
relies upon a meter of unspecified reliability. Second, 
the evidence itself may have been derived by a rule 
whose conclusion was probable rather than certain. 



Use of Vague Language 

Rulel, above, is based around the notion of a "high" 
flow rate. There is uncertainty over whether "high" 
means a flow rate of the order of lcm 3 s -1 or lm 3 s _1 . 

Two popular techniques for handling the first two 
sources of uncertainty are Bayesian updating and cer- 
tainty theory (Hopgood, 200 1 ). Bayesian updating has 
a rigorous derivation based upon probability theory, but 
its underlying assumptions, e.g., the statistical indepen- 
dence of multiple pieces of evidence, may not be true 
in practical situations. Certainty theory does not have 
a rigorous mathematical basis, but has been devised as 
a practical and pragmatic way of overcoming some of 
the limitations of Bayesian updating. It was first used 
in the classic MYCIN system for diagnosing infectious 
diseases (Buchanan, 1984). Other approaches are re- 
viewed in (Hopgood, 2001), where it is also proposed 
that a practical non-mathematical approach is to treat 
rule conclusions as hypotheses that can be confirmed or 
refuted by the actions of other rules. Possibility theory, 
or fuzzy logic, allows the third form of uncertainty, i.e. 
vague language, to be used in a precise manner. 

Decision Support and Analysis 

Decision support and analysis (DSA) and decision 
support systems (DSSs) describe a broad category 
of systems that involve generating alternatives and 
selecting among them. Web-based DSA, which uses 
external information sources, is becoming increasingly 
important. Decision support systems that use artificial 
intelligence techniques are sometimes referred to as 
intelligent DSSs. 

One clearly identifiable family of intelligent DSS is 
expert systems, described above. An expert system may 
contain a mixture of simple rules based on experience 
and observation, known as heuristic or shallow rules, 
and more fundamental or deep rules. For example, 
an expert system for diagnosing car breakdowns may 
contain a heuristic that suggests checking the battery 
if the car will not start. In contrast, the expert system 
might also contain deep rules, such as Kirchoff 's laws, 
which apply to any electrical circuit and could be used 
in association with other rules and observations to 
diagnose any electrical circuit. Heuristics can often 
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provide a useful shortcut to a solution, but lack the 
adaptability of deep knowledge. 

Building and maintaining a reliable set of cause-ef- 
fect pairs in the form of rules can be a huge task. The 
principle of model-based reasoning (MBR) is that, rather 
than storing a huge collection of symptom-cause pairs 
in the form of rules, these pairs can be generated by 
applying underlying principles to the model. The model 
may describe any kind of system, including systems that 
are physical (Fenton, 2001), software-based (Mateis, 
2000), medical (Montani, 2003), legal (Bruninghaus, 
2003), and behavioral (De Koning, 2000). Models of 
physical systems are made up of fundamental compo- 
nents such as tubes, wires, batteries, and valves. As each 
of these components performs a fairly simple role, it 
also has a simple failure mode. Given a model of how 
these components operate and interact to form a device, 
faults can be diagnosed by determining the effects of 
local malfunctions on the overall device. 

Case-based reasoning (CBR) also has a major 
role in DSA. A characteristic of human intelligence 
is the ability to recall previous experience whenever 
a similar problem arises. This is the essence of case- 
based reasoning (CBR), in which new problems are 
solved by adapting previous solutions to old problems 
(Bergmann, 2003). 

Consider the example of diagnosing a broken- 
down car. If an expert system has made a successful 
diagnosis of the breakdown, given a set of symptoms, 
it can file away this information for future use. If the 
expert system is subsequently presented with details 
of another broken-down car of exactly the same type, 
displaying exactly the same symptoms in exactly 
the same circumstances, then the diagnosis can be 
completed simply by recalling the previous solution. 
However, a full description of the symptoms and the 
environment would need to be very detailed, and it is 
unlikely to be reproduced exactly. What is needed is 
the ability to identify a previous case, the solution of 
which can be reused or modified to reflect the slightly 
altered circumstances, and then saved for future use. 
Such an approach is a good model of human reasoning. 
Indeed case-based reasoning is often used in a semi- 
automated manner, where a human can intervene at 
any stage in the cycle. 



FUTURE TRENDS 

While large corporate knowledge-based systems re- 
main important, small embedded intelligent systems 
have also started to appear in the home and workplace. 
Examples include washing machines that incorporate 
knowledge-based control and wizards for personal 
computer management. By being embedded in their 
environment, such systems are less reliant on human 
data input than traditional expert systems, and often 
make decisions entirely based on sensor data. 

If AI is to become more widely situated into everyday 
environments, it needs to become smaller, cheaper, and 
more reliable. The next key stage in the development 
of AI is likely to be a move towards embedded AI, i.e. 
intelligent systems that are embedded in machines, 
devices, and appliances. The work of Choy (2003) is 
significant in this respect, as it demonstrates that the 
DARBS blackboard system can be ported to a compact 
platform of parallel low-cost processors. 

In addition to being distributed in their applications, 
intelligent systems are also becoming distributed in 
their method of implementation. Complex problems 
can be divided into subtasks that can be allocated to 
specialized collaborative agents, bringing together the 
best features of knowledge-based and computation 
intelligence approaches (Li, 2003). As the collaborat- 
ing agents need not necessarily reside on the same 
computer, an intelligent system can be both distributed 
and hybridized (Choy, 2004). Paradoxically, there is 
also a sense in which intelligent systems are becoming 
more integrated, as software agents share access to a 
single definitive copy of data or knowledge, accessible 
via the web. 



CONCLUSION 

As with any technique, knowledge-based systems are 
not suitable for all types of problems. Each problem 
calls for the most appropriate tool, but knowledge-based 
systems can be used for many problems that would 
be impracticable by other means. They have been 
particularly successful in narrow specialist domains. 
Building an intelligent system that can make sensible 
decisions about unfamiliar situations in everyday, 
non-specialist domains remains a severe challenge. 




993 



Knowledge-Based Systems 



This development will require progress in simulating 
behaviors that humans take for granted - specifically 
perception, recognition, language, common sense, and 
adaptability. To build an intelligent system that spans 
the breadth of human capabilities is likely to require 
a hybrid approach using a combination of artificial 
intelligence techniques. 
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KEY TERMS 

Backward Chaining: Rules are applied through 
depth-first search of the rule base to establish a goal. 
If a line of reasoning fails, the inference engine must 
backtrack and search a new branch of the search tree. 
This process is repeated until the goal is established 
or all branches have been explored. 
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Case-Based Reasoning: Solving new problems by 
adapting solutions that were previously used to solve 
old problem. 

Closed-World Assumption: The assumption that 
all knowledge about a domain is contained in the 
knowledge base. Anything that is not true according 
to the knowledge base is assumed to be false. 

Deep Knowledge: Fundamental knowledge with 
general applicability, such as the laws of physics, which 
can be used in conjunction with other deep knowledge 
to link evidence and conclusions. 

Forward Chaining: Rules are applied iteratively 
whenever their conditions are satisfied, subject to a 
selection mechanism known as conflict resolution when 
the conditions of multiple rules are satisfied. 



Heuristic or Shallow Knowledge: Knowledge, 
usually in the form of a rule, that links evidence and 
conclusions in a limited domain. Heuristics are based 
on observation and experience, without an underlying 
derivation or understanding. 

Inference Network: The linkages between a set of 
conditions and conclusions. 

Knowledge-Based System: System in which the 
knowledge base is explicitly separated from the infer- 
ence engine that applies the knowledge. 

Model-Based Reasoning: The knowledge base 
comprises a model of the problem area, constructed from 
component parts. The inference engine reasons about 
the real world by exploring behaviors of the model. 

Production Rule: A rule of the form if <condition> 
then <conclusion>. 
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INTRODUCTION 

In the analysis of a temporal process, Kohonen maps 
may be used together with time-series (TS) algorithms. 
Previous research aimed at combining Kohonen al- 
gorithms and Markov switching models in order to 
suggest a periodization of the international bimetal- 
lism in the 19 th century (Boyer-Xambeu, Deleplace, 
Gaubert, Gillard and Olteanu, 2006). This research 
was based on an economic study of the international 
monetary system ruling at this time in Europe, which 
combined three monetary zones: a gold-standard one, 
centred in London, a bimetallic one, centred in Paris, 
and a silver-standard one, centred in Hamburg (Boyer- 
Xambeu, Deleplace and Gillard, 2006). The three major 
financial centres of that system (London, Paris, and 
Hamburg, hence the label LPH used hereafter) were 
linked through arbitrage operations between markets 
for gold and silver and markets for foreign exchange 
located in those centres. Since two metals, gold and 
silver, acted as monetary standards in that system, it 
worked as an international bimetallism. Its growing 
integration during half a century (from 1821 to 1873) 
was reflected in the convergence of the observed levels 
of the relative price of gold to silver in London, Paris, 
and Hamburg. However, this integration process was 
subject to various changes, which can be understood 
as exogenous shocks disturbing that process. 



One such shock is vastly documented in the litera- 
ture: the discovery of new gold mines in the United 
States and Australia, which led to a sudden decline 
in 1850 of the gold-silver price over all the markets 
in the world. This decline was not of the same mag- 
nitude everywhere, and therefore the spread between 
the London, Paris, and Hamburg gold-silver prices 
increased, stopping for a time the integration process of 
the system. This is what we will call a breaking in that 
process. The present paper aims at locating the major 
breakings occurring during the period of international 
bimetallism; a historical study could link them to special 
events, which operated as exogenous shocks on that 
system. The indicator of integration used is the spread 
between the highest and the lowest among the London, 
Paris, and Hamburg gold-silver prices. 

Three algorithms are combined to study this integra- 
tion: a periodization obtained with the SOM algorithm 
is confronted to the estimation of a two-regime Markov 
switching model, in order to give an interpretation 
of the changes of regime; at the same time change- 
points are identified over the whole period providing 
a more precise interpretation of these varying types 
of regulation. 

Section 2 summarizes the results obtained with the 
SOM algorithm to differentiate the sub-periods obtained 
using the whole available data. 

Section 3 presents the kind of model used and the 
results of its estimation using the new indicator, the 
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spread computed at each period of quotation between 
the three relative prices of gold in silver. The sub-pe- 
riods are confronted to the two regimes obtained and 
some evidence of a relation between the regime and 
the volatility of the spread is presented. 

Section 4 presents the technique used to identify 
change-points in the temporal process and some strong 
results of breaks in mean and in variance of the spread 
are obtained. They are interpreted in terms of monetary 
history as, for some of them, they are quite new in the 
literature of this domain. 

Some further directions of research are indicated 
in conclusion. 



THE SUB-PERIODS OBTAINED WITH A 
SOM ALGORITHM 1 

The Data 

The relative prices of gold in silver are computed from 
the price of each metal observed, twice a week, in each 
of the three financial places, Paris, London and Hamburg 
(respectively, poa, Igs, and hod), from the beginning 
of 1821 until the end of 1860. The same type of data 
is available for the exchange rates (Pound in Francs, 
Pound in Marks, Mark in Francs: respectively, Ipv, 
hlv, and phv). 

An observation is a set of twelve values, two 
quotations (Tuesday and Friday) for each of the six 
variables. 

A computed variable has been added to emphasize 
the relation between the relative price of metals in 
Hamburg and the average level in Paris and London 
of this value (hpl). 

Most of the time the quotations show rather small 
differences within a given week, but periods with im- 
portant troubles, Paris in the late 1840s for instance, 
may be well separated from the more classical ones. 

After the Kohonen classification using a grid of 25 
nodes, a hierarchical ascending classification is used to 
produce a small number of macro classes, in this case 6 
macro classes, corresponding to the main sub-periods. 
This latter classification is constructed with the code 
vectors obtained from the first process 2 . 



Characteristics of the Macro-Classes 

Large sequences of contiguous weeks are grouped in 
the macro-classes, however a few years are fragmented 
in short periods situated in different classes 

Class 1 is constituted of 3 groups of years 1829- 
30, 1834-38, 1848-49 and a lot of fragments of 
other years 

Class 2 is more simple to describe with 3 intervals 
1832-33, 1842-43 and 1846-47 and some sparse 
weeks from the 1830s. 

They represent a central position contrasting to the 
well identified other classes: 

Class 3: 2 sets constituted of years 1824-25 and 

1827-28, with almost no missing weeks in these 

intervals, indicating that this sub-period is very 

homogeneous 

Class 4: the end of year 1853 and the whole period 

1 854-60; again only a small number of weeks are 

missing for this continuous sub-period of more 

than seven years 

Class 5 : 1 82 1 -24 and 1 826-beginning 1 827 plus 

small parts of 1830 and 1832 

Class 6: two sets 1839-41 and 1851-53 

The means of the variables used to obtain the clas- 
sification can be represented to illustrate the great dif- 
ferences appearing between the sub-periods. Changing 
hierarchies between the relative prices are the charac- 
teristic identifying the four last macro-classes. 

Rearranging the various classes according to 
calendar time allows to distinguish between three 
sub-periods: a) the 1820s (classes 5 and 3, covering 
1821 to 1828); b) the 1830s and 1840s (classes 1 and 
2, covering 1829 to 1849); c) the 1850s (classes 6 and 
4, covering 1851 to 1860). 

Only the years 1839-41 resist to that rearrangement, 
since they belong to class 6, while they should appear 
in classes 1 and 2 relative to the 1 830s and 1 840s; some 
explanation will be suggested in the last section. 

Fig. 1. exhibits two contrasted situations, where 
the gold-silver price is respectively low (class 4) and 
high (class 5) in all the three financial centres. Fig. 2. 
confirms that opposition, since the two classes are also 
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Figure 1. Gold-silver price and the 6 macro-classes 




Figure 2. Exchange rates and the 6 macro-classes 



Exchange Rates 





sharply contrasted by the levels of the exchange rates. 
Years 1821 -23 and 1 826 (class 5) are marked by a low 
mark/franc exchange rate and high gold-silver prices, 
the Hamburg one being higher than the Paris one; years 
1854-60 (class 4) are marked by a high mark/franc 
exchange rate and low gold-silver prices, the Hamburg 
one being below the Paris one. 



These remarks, which also apply respectively to the 
rest of the 1820s (class 3) and to the rest of the 1850s 
(class 6) are consistent with historical analysis: while 
the Hamburg mark was always anchored to silver, the 
French franc was during the 1 820s and 1850s anchored 
to gold (in contrast with the 1830s and 1840s when it 
was anchored to silver); it is then normal that the mark 
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depreciated against the franc when silver depreciated 
against gold, and more in Hamburg than in Paris (as in 
class 5 and 3), and that the mark appreciated against the 
franc when silver appreciated against gold, and more 
in Hamburg than in Paris (as in class 4 and 6). 



A MODEL FOR THE SPREAD 
BETWEEN THE HIGHEST AND THE 
LOWEST GOLD-SILVER PRICE 

An Auto regressive Markov Switching 
Model 

The key assumption is that the time series to be modeled 
follow a different pattern or a different model according 
to some unobserved, finite valued process. Usually, the 
unobserved process is a Markov chain whose states are 
called "regimes", while the observed series follows a 
linear autoregressive model whose coefficients depend 
on the current regime. 

Let us put this in a mathematical language. Sup- 
pose that (y ) z is the observed time series and that 
the unobserved process (x t ) teZ is a two-states Markov 
chain with transition matrix 



P 
1-p 



1-q 

q 



, where p,q e]0,l[ 



(i) 



The parameters of the model are then 

f 1 1 122 212 1 

\a ,a 1 ,...,a l ,a ,a 1 ,...,a l ,c ,c ,p,q$ 

and they are usually estimated by maximizing the log- 
likelihood function via an EM (Expectation - Maxi- 
mization) algorithm 3 . 

Our characteristic of interest will be the "a posteriori" 
computed conditional probabilities of belonging to the 
first or to the second regime. Indeed, as our goal is to 
derive a periodization of the international bimetallism, 
the "a posteriori" computed states of the unobserved 
Markov chain will provide a natural one. 

Although the results obtained with a switching 
Markov model are usually satisfying in terms of predic- 
tion and the periodizations are interesting and easily 
interpretable, a difficulty remains : how does one choose 
the number of regimes? In the absence of a complete 
theoretical answer, the criteria for selecting the "right" 
number of regimes are quite subjective from a statisti- 
cal point of view 4 . 

The Results 

In this paper we use a two-regime model to represent the 
spread computed with the gold-silver prices observed 
at each period on the three places. The transition matrix 
indicates good properties of stability: 




Then, assuming thaty t depends on the first / lags of 
time, we have the following equation of the model: 



y t =a < +a 1 <y t _ 1 +... + a l <y t _ l +G <e t 



(2) 



where a** e {z*,a? }e R 2 for every z e {0,1,. ..,/}, 
a Xf eya 1 ^ 2 ^ (ft*J and z t is a standard Gaussian 



noise. 



^0.844298 0.253357^ 
0.155702 0.746643 



v 



and no three regime model was found with an accept- 
able stability. 

The first regime is a multilayer perceptron with one 
hidden layer, the second one is a simple linear model 
with one lag. Using the probabilities computed for each 
regime at each period, it may be interesting to study 
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Table 1. Regime 1 and volatility of spread 



Sub-periods 


Number of obs. 


% regime 1 


Standard deviation 
of spread 


1 


483 


0.733 


0.053 


2 


335 


0.627 


0.061 


3 


191 


0.445 


0.075 


4 


376 


0.816 


0.044 


5 


303 


0.625 


0.050 


6 


390 


0.723 


0.049 



the six sub-periods obtained and to observe the switch 
between the regimes along these periods of time. 

Most of the time the regime 1 explains the spread 
(about 70% of the whole period) but important differ- 
ences are to be noted between the sub-periods: 

Classes 3 and 4 clearly contrast with, respectively, 
the highest and the lowest volatility of spread as they 
are ruled by, respectively, regime 2 and regime 1 
models. 

As will be explained later, further investigations 
have to be made with a more complex model and us- 
ing a more adapted indicator of the arbitrages ruling 
the markets. 



changed. The changes, whose number and configura- 
tion are unknown, occur in the marginal distribution 
and may be in mean, in variance or in both mean and 
variance. We assume that there exists an integer K* 
and a sequence of change-points i ={tiV.^'| with 



■<V-i<V 



T such that (u.,X,)^(lx 



k+V 



T =0<T 1 <. 

I k+1 )where^=£(Y t )andZ k =Cov(Y t )=£(F t -£(Y t ))(Y t 

-E(Y t )) T ,T;_ 1+ i<t<x;. 

The numbers of changes as well as their configura- 
tion are computed by minimizing a penalized contrast 
function. Details on the algorithms for computing the 
change-points configuration t* can be found in Lavielle 
and Teyssiere (2006) 6 . 



IDENTIFICATION OF CHANGE-POINTS: 
A GLOBAL VISION OF THE 
BIMETALLIST SYSTEM OF PAYMENTS 

Elements About the Technique 5 

A different approach to model changes of regime in a 
time-series is to detect change-points or breaks. Here, 
the main assumption is that the whole series is ob- 
served and change-points are computed "a posteriori". 
Thus, this approach has not a predictive goal, but it is 
rather aimed at explaining the series by a piecewise 
stationary process which seems to be well adapted to 
our problem. 

Mathematically, the model can be written as follows : 
let us consider the observed m-dimensional series y t = 
{yi,t>-> y m,) T > l = 1 '-' T and su PP ose that it is abruptly 



Some Results and Interpretation 

Applying this technique to the spread gave 7 change- 
points in mean and 4 in mean and variance. 

Fig. 3 summarizes the spread, the four change-points 
(the first 4 green lines in chronological order) obtained 
in mean and variance, and the 2 last change-points in 
mean which correspond to a major break in the level 
of the gold-silver price, observed simultaneously on 
the three places and correspond to the great change in 
production of gold in United States. 

A closer look at the spread between the highest 
and the lowest among the London, Hamburg and Paris 
gold-silver prices draws attention upon three episodes, 
each of them beginning with a break which sharply 
increases the spread and ends with another breaking 
which sharply narrows it (green vertical lines on Fig. 
3). These episodes have in common to be linked to 
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Figure 3. Spread, change-points and probability of regime 1 



SPREAD (Max-Min) 





»$> ^ <& *tf> <& d> j$> <& *# <s$> d> d> «s?> *tf> <sS> d> J$> <$> ^ <sS> 
$> <p jP ^ j? j> ^ <p / ^ f s jP ^ ^ 4P j> <p ^ ^ ^ 

spread probl 



shocks affecting the integration process of the LPH 
system, although the shocks may have been asym- 
metrical (only one or two of the financial centres being 
initially hit) or symmetrical (the three of them being 
simultaneously hit). 

The first episode runs from the 2 1 st week of 1 824 till 
the 41 st week of 1 825. The sharp initial increase in the 
spread maybe explained by two opposite movements in 
London and Hamburg: on one side, heavy speculation 
in South-American bonds and Indian cotton fuelled 
in London the demand for foreign payments in silver, 
which resulted in a great increase in the price of silver 
and a corresponding decline in the gold-silver price; 
on the other side, the price of gold rose in Hamburg 
while the price of silver remained constant, sparkling 
the huge spread between the highest (Hamburg) and 
the lowest (London) gold-silver prices. More than one 
year later, the opposite movements took place : the price 
of gold plunged in Hamburg, while the price of silver 
remained at its height in London, under the influence 
of continuing speculation (which would end up in the 
famous banking crisis of December 1 825); consequently 
the spread abruptly narrowed, this event being reflected 
by the breaking of the 41 st week of 1825. 



The second episode runs from the 45 th week of 1 83 9 
till the 13 th week of 1843. It started with the attempt 
of Prussia to unify the numerous German-speaking 
independent states in a common monetary zone, on a 
silver standard. Since the Bank of Hamburg maintained 
the price of silver fixed, that pressure on silver led to a 
drop in the Hamburg price of gold, and consequently 
in its gold-silver price, at a time when it was more or 
less stabilized in Paris. The spread between the high- 
est (Paris) and the lowest (Hamburg) gold-silver price 
suddenly was enlarged, and during more than three 
years remained at a level significantly higher than 
during the 14 preceding years. This episode ended 
with the breaking of the 1 3 th week of 1 843, when, this 
shock having been absorbed, the gold-silver price in 
Hamburg went back in line with the price in the two 
other financial centres. 

The third episode runs from the 46 th week of 1 850 
till the 41 st week of 1854. The shock was then sym- 
metrical: London, Paris and Hamburg were hit by the 
pouring of gold following the discovery of Californian 
mines, and the sudden downward pressure on the world 
price of that metal. It took four years to absorb this 
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enormous shock, as reflected by the breaking of the 
41 st week of 1854. 



CONCLUSION 

In the three cases, the integration process of the LPH 
system, shown by the downward trend of the spread 
over half a century, was jeopardized by a shock: a 
speculative one in 1824, an institutional one in 1839, 
a technological one in 1850. But the effects of these 
shocks were absorbed after some time, thanks to ac- 
tive arbitrage operations between the three financial 
centres of the system. Generally, that arbitrage did not 
imply the barter of gold for silver but the coupling of 
a foreign exchange operation (on bills of exchange) 
with the transport of one metal only. 

As a consequence, it would be appropriate in a 
further study to locate the breakings of another indica- 
tor of integration: the spread between a representative 
"national" gold-silver price and an arbitrated interna- 
tional gold-silver price taking into account the foreign 
exchange rates. At the same time it would be interesting 
to go further with the Markov switching model, trying 
more complete specifications. 
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KEY TERMS 

Change-Point: Instant of time where the basic 
parameters of time series change (in mean and/or in 
variance); the series may be considered as a piecewise 
stationary process between two change-points 

Gold-Silver Price : Ratio of the market price of gold 
to the market price of silver in one place. The stability 
of that ratio through time and the convergence of its 
levels in the various places constituting the interna- 
tional bimetallism (see that definition) are tests of the 
integration of that system. 

International Arbitrage: Activity of traders in gold 
and silver and in foreign exchange, which consisted in 
comparing their prices in different places, and in moving 
the precious metals and the bills of exchange accord- 
ingly, in order to make a profit. Arbitrage and monetary 
rules were the two factors explaining the working of 
international bimetallism (see that definition). 

International Bimetallism: An international mon- 
etary system (see that definition) which worked from 
1 82 1 to 1 873 . It was based on gold and silver acting as 
monetary standards, either together in the same country 
(like France) or separately in different countries (gold 
in England, silver in German and Northern states). The 
integration of that system was reflected in the stabil- 
ity and the convergence of the observed levels of the 
relative price of gold to silver (see that definition) in 
London, Paris, and Hamburg. 

International Monetary System: A system link- 
ing the currencies of various countries, which ensures 



the stability of the exchange rates between them. Its 
working depends on the monetary rules adopted in each 
country and on international arbitrage (see that defini- 
tion) between the foreign exchange markets. Historical 
examples are the gold-standard system (1873-1914) and 
the Bretton- Woods system ( 1 944- 1 976). The paper stud- 
ies some characteristics of another historical example: 
international bimetallism (see that definition). 

Markov Switching Model: An autoregressive 
model where the process linking a present value to its 
lags is an hidden Markov chain defined by its transi- 
tion matrix 

SOM Algorithm: An unsupervised technique of 
classification (Kohonen, 1984) combining adaptative 
learning and neighbourhood to construct a very stable 
classification, with a more simple interpretation ('Ko- 
honen maps') than other techniques. 



ENDNOTES 



Details may be found in Boyer-Xambeu, ..., 

Olteanu, 2006. 
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INTRODUCTION 

The view of artificial neural networks as adaptive 
systems has lead to the development of ad-hoc generic 
procedures known as learning rules. The first of these is 
the Perceptron Rule (Rosenblatt, 1 962), useful for single 
layer feed-forward networks and linearly separable 
problems. Its simplicity and beauty, and the existence 
of a convergence theorem made it a basic departure 
point in neural learning algorithms. This algorithm 
is a particular case of the Widrow-Hoff or delta rule 
(Widrow & Hoff, 1960), applicable to continuous 
networks with no hidden layers with an error function 
that is quadratic in the parameters. 



BACKGROUND 

The first truly useful algorithm for feed-forward mul- 
tilayer networks is the backpropagation algorithm 
(Rumelhart, Hinton & Williams, 1986), reportedly 
proposed first by Werbos (1974) and Parker (1982). 
Many efforts have been devoted to enhance it in a 
number of ways, especially concerning speed and reli- 
ability of convergence (Haykin, 1994; Hecht-Nielsen, 
1 990). The backpropagation algorithm serves in general 
to compute the gradient vector in all the first-order 
methods, reviewed below. 

Neural networks are trained by setting values for 
the network parameters w to minimize an error func- 
tion E(w). If this function is quadratic in w, then the 
solution can be found by solving a linear system of 
equations (e.g. with Singular Value Decomposition 
(Press, Teukolsky, Vetterling & Flannery, 1992)) or 
iteratively with the delta rule. The minimization is 
realized by a variant of a gradient descent procedure, 
whose ultimate outcome is a local minimum: a w from 
which any infinitesimal change makes E(w ) increase, 
that may not correspond to one of the global minima. 
Different solutions are found by starting at different 
initial states. The process is also perturbed by round- 



off errors. Given E(w) to be minimized and an initial 
state w , these methods perform for each iteration the 
updating step: 



w L 



=w z +a-u z 



(i) 



where u l is the minimization direction (the direction in 
which to move) and a-eR is the step size (how far to 
make a move in u z ), also known as the learning rate in 
earlier contexts. For convenience, define Aw 1 = w z + -w z . 
Common stopping criteria are: 

1 . Amaximum number of presentations of D (epochs) 
is reached. 

2. A maximum amount of computing time has been 
exceeded. 

3 . The evaluation has been minimized below a certain 
tolerance. 

4. The gradient norm has fallen below a certain 
tolerance. 



LEARNING ALGORITHMS 

Training algorithms may require information from 
the objective function only, the gradient vector of the 
objective function or the Hessian matrix of the objec- 
tive function: 

Zero-order training algorithms make use of the 
objective function only. The most significant 
algorithms are evolutionary algorithms, which are 
global optimization methods (Goldberg, 1989). 
First-order training algorithms use the objective 
function and its gradient vector. Examples are 
Gradient Descent, Conjugate Gradient or Quasi- 
Newton methods, which are all local optimization 
methods (Luenberger, 1984). 
Second-order training algorithms make use of 
the objective function, its gradient vector and its 
Hessian matrix. Examples are Newton 's method 
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and the Levenberg-Marquardt algorithm, which 
are local optimization methods (Luenberger, 
1984). 

First-order methods. The gradient VEw of an s- 
dimensional function is the vector field of first deriva- 
tives of E(w) w.r.t. w, 



VEw=( 



dE(w) dE(w) 



9w r 



) 



(2) 



Here s=dim(w). A linear approximation to E(w) in 
an infinitesimal neighbourhood of an arbitrary point 
w z is given by: 



£(w) « Eiw^+VEwiw^iw-w 1 ) 



(3) 



We write VEw(w l ) for the gradient VEw evalu- 
ated at w z . These are the first two terms of the Taylor 
expansion of E(w) around w z . In steepest or gradient 
descent methods, this local gradient alone determines 
the minimization direction u l . Since, at any point w z , 
the gradient VEw(w l ) points in the direction of fastest 
increase of E(w), an adjustment of w z in the negative 
direction of the local gradient leads to its maximum 
decrease. In consequence the direction u z = -VEw(w l ) 
is taken. 

In conventional steepest descent, the step size Ot- 
is obtained by a line search in the direction of u l : how 
far to go along u l before a new direction is chosen. 
To this end, evaluations of E(w) and its derivatives 
are made to locate some nearby local minimum. Line 
search is a move in the chosen direction u l to find the 
minimum of E(w) along it. For this one-dimensional 
problem, the simplest approach is to proceed along u l 
in small steps, evaluating E(w) at each sampled point, 
until it starts to increase. One often used method is a 
divide-and-conquer strategy, also called Brent's method 
(Fletcher, 1980): 



1. 



Bracket the search by setting three points a<b<c 
along u l such that E(au l )>E(bu l )<E(cu l ). Since 
E is continuous, there is a local minimum in the 
line joining a to c. 

Fit a parabola (a quadratic polynomial) to a,b,c. 
Compute the minimum \i of the parabola in the 
line joining a to c. This value is an approximation 
of the minimum of E in this interval. 



4. Set three new points a,b,c out of \i and the two 
points among the old a,b,c having the lowest E. 
Repeat from 2. 

Although it is possible to locate the nearby global 
minimum, the cost can become prohibitedly high. 
The line search can be replaced by a fixed step size a, 
which has to be carefully chosen. A sufficiently small 
a is required such that aVEw(u z ) is effectively very 
small and the expansion (3) can be applied. A too large 
value might cause to overshoot or lead to divergent 
oscillations and a complete breakout of the algorithm. 
On the other hand, very small values translate in a pain- 
fully slow minimization. In practice, a trial-and-error 
process is carried out. 

A popular heuristic is a historic average of previ- 
ous changes to exploit tendencies and add inertia 
to the descent, accomplished by adding a so-called 
momentum term p Z -Aw z , where Aw z_1 is the previous 
weight update (Rumelhart, Hinton & Williams, 1986). 
This term helps to avoid or smooth out oscillations in 
the motion towards a minimum. In practice, it is set 
to a constant value Pe (0.5,1). Altogether, for steepest 
descent, the update equation (1) reads: 




J+l-.J 



w z "" 1 =w z +a / u z + pAw i 



i-l 



(4) 



where u^-VEwiw*) and Aw z " 1 =w z -w z " 1 . This method 
is very sensitive to the chosen values for a- and p, to 
the point that different values are required for different 
problems and even for different stages in the learning 
process (Toolenaere, 1990). The inefficiency of the 
steepest descent method stems from the fact that both 
u l and a z - are somewhat poorly chosen. Unless the first 
step is chosen leading straight to a minimum, the itera- 
tive procedure is very likely to wander with many small 
steps in zig-zag. Therefore, these methods are quite out 
of use nowadays. A method in which both parameters 
are properly chosen is the conjugate gradient. 

Conjugate Gradient. This minimization technique 
(explained at length in Shewchuck, 1994) is based on 
the idea that a new direction u z+1 should not spoil 
previous minimizations in the directions u l ,u l ~ ,...,u . 
This is the case if we simply choose u l =-g}, where 
g!=VEw(w l ), as was found above for steepest descent. 
At most points on E(w), the gradient does not point 
directly towards the minimum. After a line minimiza- 
tion, the new gradient g} + 1 is orthogonal to the line 
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search direction, that is, fl z+1 -u z =0. Thus, successive 
search directions will also be orthogonal, and the er- 
ror function minimization will proceed in zig-zag in a 
extremely slow advance to a minimum. 

The solution to this problem lies in determining 
consecutive search directions u z+1 in such a way that 
the component of the gradient parallel to u l (which has 
just been made to be zero, because we minimized in that 
direction) remains zero, so consecutive search direc- 
tions complement each other, avoiding the possibility 
of spoiling the progress done in previous iterations. 

Let us assume a line minimization has just been 
made along u z , starting from the current weights w z ; 
we have thus found a new point w z + 1 for which 



VEw(w z+1 >u z =0 



(5) 



holds. The next search direction u l + * is chosen to retain 
the property that the component of the gradient parallel 
to u z , remains zero: 



VEw(w z+1 +a,u z+1 )-u z =0 



(6) 



Expanding (6) to first order in a ; -, and applying (5), 
we obtain the condition (Bishop, 1995): 



u !+1 -tf w (w ! ' +1 >u ! =0 



(7) 



If the error surface is quadratic, (7) holds regardless 
of the value of a •, because the Hessian is constant, and 
higher-order terms in the previous expansion vanish. 
Search directions if z+1 ,u z fulfilling (7) are said to be 
conjugate. It can be proven that, in these conditions, 
it is possible to construct a sequence u 1 ,...,^ such that 
u s is conjugate to all previous directions, so that the 
minimum can be located in at most s=dim(w) steps. 

The conjugate gradient technique departs from (1) 
but sets ii z+1 =-fl z+1 + P-if z , setting u 1 =-fl 1 . It turns out 
that the p. can be found without explicit knowledge 
of the Hessian and the various versions of conjugate 
gradient are distinguished by the manner in which the 
parameter p • is set. For the Polak-Ribiere updating: 



the squared norm of the previous gradient. For the 
Fletcher-Reeves updating: 



0r 



i+l i+l 

g'g' 



(9) 



This is the ratio of the squared norm of the current 
gradient to the squared norm of the previous gradient. 
The a- can be found by performing a line search (at 
each iteration step) to determine the optimal distance 
to move along the current train direction; that is, a line 
minimization of VEw(w z + a z -u z ) w.r.t. a •. This results in 
a particular case of steepest descent with momentum, 
where the parameters a-,P- are determined at each it- 
eration. For a quadratic error surface E(w), the method 
finds the minimum after at most s = dim(w) steps, without 
calculating the Hessian. In practice E(w) may be far 
from being quadratic so the technique needs to be run 
for many iterations and augmented with a criterion to 
reset the search vector to the negative gradient direction 
u z =-fl after every s steps (Press, Teukolsky, Vet- 
terling & Flannery, 1 992). The scaled version (Nteller, 
1993) takes some account of the non-quadratic nature 
of the error function. With these enhancements the 
method is generally believed to be fast and reliable. 
Contrary to steepest descent, it is relatively insensitive 
to its parameters -the line search for a • and the variants 
of computing P z — if they are set within a reasonable 
tolerance. There is some evidence that the Polak- 
Ribiere formula accomplishes the transition to further 
iterations more efficiently: when it runs out of steam, 
it tends to reset the train direction u l to be down the 
local gradient, which is equivalent to beginning the 
conjugate-gradient procedure again. 

The Newton training algorithm. First-order ap- 
proximations ignore the curvature of E(w). This can 
be fixed by considering the second-order term of the 
Taylor expansion around some w z in weight space: 

£(w)~E(w z yV£u^^ 

w z ) " (10) 



Pi= 



g i+1 (g 1+1 -g) 
gg 



This is the inner product of the previous change 
in the gradient with the current gradient, divided by 



where H=VVEw is the Hessian sxs matrix of second 

Vv 



(8) derivatives, with components 



H w = (V' V 



d 2 E(w) 
5w ; 5w,. 
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Again, H w (w l ) indicates the evaluation of H w in w z . 
The Hessian traces the curvature of the error function in 
weight space, portraying information about how VEw 
changes in different directions. Given a direction u l from 
w z , the product H w (w z )-u z is the rate of change of the 
gradient along u l from w z . Differentiating (10) w.r.t. w, 
a local approximation of the gradient around w z is: 



VEwj* VEwiw^+H^iw^iw-w 1 ) 



descent direction method can be obtained by setting 
G=L 

In order to implement Equation ( 1 4), an approximate 
inverse G of the Hessian matrix is needed. Two com- 
monly used algorithms are the Davidon-Fletcher-Powell 
(DFP) algorithm and the Broyden-Fletcher-Goldfarb- 
Shanno (BGFS) algorithm. The DFP algorithm is 
given by 




(11) 



i+1 i \ /c-n / i+1 i 



For points close to w z , (10) and (11) give reason- 
able approximations to the error function E(w) and to 
its gradient. Setting (11) to zero, and solving for w, 
one gets: 



g( z + 1) = g(0+ (™ -w)®(w -w ) 



w* = -H^JwO-VEw^') 



(12) 



, i+1 i \ / i+1 i \ 

(w -w ).(g -g ) 



[G (i \(i +1 -i)]®(G«.U +1 -?!)] 
+ " (i +1 -i).G (i, .(i +1 -i) < d5) 

where ® denotes outer product, which is a matrix: the 
i,j component of u® v is u -v-. The BGFS algorithm is 
exactly the same, but with one additional term: 



i+l i \ /o / i+l i 



, i+l i \ /i+l i \ 

(W -w }(g -g ) 



i(i) f €% i+ 1 rt * 



0) r rt /+I _ rt '' 



Newton's methoduses the secondpartial derivatives 
of the objective function and hence is a second order 
method, finding the minimum of a quadratic function 

in just one iteration. The vector H' 1 Jw z > V£w(w z ) is r (i + 1) = r (i) + (w ;+i -w ; j®(w ;+i -w ; j 
known as the Newton train direction. Since higher-order 
terms have been neglected, the update formula (12) is 
used iteratively to find the optimal solution. An exact 
evaluation of the Hessian is computationally demanding 
if done at each stage of an iterative algorithm; the Hes- 
sian matrix must also be inverted. Further, the Newton 
train direction may move towards a maximum or a 
saddle point rather than a minimum (if the Hessian is 
not positive definite) and the error would not be guar- 
anteed to be reduced. This motivates the development 
of alternative approximation methods. 

The quasi-Newton training algorithm. As shown 
above, the iterative formula used in Newton's method 
is 



(i +1 -i).G (i, .(i +1 -i) ' 

+ ((fl ;+1 -fl ; )-G©-(fl /+1 -fl ; )) v<8> v, (16) 

where the vector v is given by 



(W -w ) 



w i+1 = w^H-^iwjyVEwiw 1 ) 



(13) 



/■ i+1 i \ / -,i+l — i \ 

(w -w ).(g -g ) 



G & .(i +1 -j) 
(g i+1 -% i ).G m .(% i+1 -g i ) 



(17) 



The basic idea behind the quasi-Newton method 
is to approximate # -1 w by another matrix G, using 
only the first partial derivatives of the error function. 
If H _1 w is approximated by G, Equation (13) can be 
expressed as 



w z+1 = w z -a*( z ')(G(w z )-VEw(w z )) 



(14) 



where a*W can be considered as the optimal train rate 
along the train direction G(w l yVEw(w l ). The gradient 



It is generally recognized that the BFGS scheme 
is empirically superior to the DFP scheme (Press, 
Teukolsky, Vetterling & Flannery, 1992). 

The Levenberg-Marquardt method, like the 
quasi-Newton methods, was designed to approach 
second-order training speed without computing the Hes- 
sian. When the objective function is a sum of squares 
(a very common case in neural networks), the Hessian 
matrix can be approximated as H=J T J and the gradient 
computed as fl= J T e, where J is the Jacobian matrix, 
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containing first derivatives of the network errors with 
respect to the weights, and e is the vector of network 
errors. The Jacobian can be computed via standard 
backpropagation, a much less complex process than 
computing the Hessian. 

The Levenberg-Marquardt algorithm uses this ap- 
proximation to the Hessian in the following Newton- 
like update: 



p m 



£(w)=j/2Z Z (y w/c -C^' c+1 /c) 2 



(20) 



where £ ^ ,c+ \-F w ( x u)k is the /c-th component of the 
network's responsFto input pattern x (the network 
has c+1 layers, of which c are hidden). For a given 
input pattern x , define: 



H' 



w i+1 =w l - {J T J+\xTy 1 J T e 



(18) 



When the scalar \i is zero, this is Newton's method 
using the approximate Hessian matrix. When \i is large, 
this becomes gradient descent with a small step size. 
Newton's method is faster and more accurate near an 
error minimum, so the aim is to shift towards Newton's 
method as quickly as possible. Thus, \i is decreased after 
each successful step (reduction in objective function) 
and is increased only when a temptative step would 
increase it. When computation of the Jacobian becomes 
prohibitedly high for big networks on a large number 
of training examples, the quasi-Newton method is 
preferred. 



so that 



k=l 



(21) 



£(w)=Z E^(w). 

The computation of a single unit i in layer /, 1< /< c+ 1 
upon presentation of pattern x to the network may be 
expressed qj 1 ' 1 =g(^' / ), with g a smooth function -as 
the sigmoidals- and 



THE BACK-PROPAGATION ALGORITHM The first outputs are then defined ^^-x-. A single 



This algorithm represented a breakthrough in connec- 
tionist learning due to the possibility to compute the 
gradient of the error function VEw(w l ) recursively and 
efficiently. It is so-called because the components of 
the gradient concerning weights belonging to output 
units are computed first, and propagated backwards 
(toward the inputs) to compute the rest, in the order 
marked by the layers . Intuitively, the algorithm finds the 
extent to which the adjustment of one connection will 
reduce the error in the training examples (the partial 
derivative of the error function E(w) with respect to 
the connection) and therefore the algorithm computes 
the full gradient vector VEw. 

To derive the algorithm, we introduce some notation. 
Given f: R n ^> R m the function to be approximated, we 
depart from a finite training set D of p samples of f, 

D={<x iyi >,...,<x p ,y p >}, /(x^+s =y M (19) 

For simplicity, we assume that the loss is the square 
error and define: 



weight w • • denotes the connection strength from neuron 
j in layer /-l to neuron z in layer /, 1< / < c+1. 

If the gradient-descent rule (1) with constant a is 
followed, then u l - -VEw(w l ). Together with (2), the 
increment Aw-- in a single weight w-- of w is: 



SE(w) p 5E^(w) 

Aw,-/ = -a - ) = -a 2_, 



y 



dW ij U 3Wy 



= -«X AM-wy' 



ti=l 



We have: 



\L,l %J?\L,l 



AV„.J = to*&) _ dE*<z)dqy ds 



dW: 



Proceeding from right to left in (23): 

as7 J 



(22) 



(23) 



H 



■=s; 



H./-1 



(24) 
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dq?> ! dg(q?>>) 



%r dq 



n,' 



=g%n 



(25) 



Assuming g to be the logistic function with slope 



(26) 



g'(sT') = Pg(?T-') d-g( <;?'')) = P ?r' d-?r') 

The remaining expression 
dE"(w) 



is more delicate and constitutes the core of the algorithm. 
We develop two separate cases: l-c+1 and /<c+l. The 
first case corresponds to output neurons, for which an 
expression of the derivative is immediate from (21): 



dE^w) 

±=L=_( y ._r^ c+1 ) 



(27) 



Incidentally, collecting all the results so far and 
assuming c=0 (no hidden layers), the delta rule for 
non-linear single-layer networks is obtained: 



(28) 



n=i 



n=i 



where the superscript / is dropped (since c=0), and 
^i = ^u i~ SD ffX?/ 1 )- Another name for the back- 
propagation algorithm is generalized delta rule. Con- 
sequently, for />1, the de/tas (errors local to a unit) 
are defined as: 



5 V,l-. 



5E^(w) 

as 



m,i 



(29) 



For the output layer /=c+l, 8 ^ •= (y •- <;/*) g'( 
Q). In case g is the logistic, we finally have: 



^' l i=(y P r^' c+1 )P ?,"'' (i-q?*) 



(30) 



For the general case l<c+l, we proceed as fol- 
lows: 




g£"(w) _y a£"(w)^r ,f 



(31) 



Now /<c+l means there is at least one more layer 
/ + 1 to the right of layer / which is posterior in the feed- 
forward computation, but thatprecedes / in the opposite 
direction. Therefore SE^(w) functionally depends on 
layer /+ 1 before than layer /, and the derivative can be 
split in two. The summation over k is due to the fact 
that SE^(w) depends on every neuron in layer Z+l. 
Rewriting (31): 



i,i La o^+U a- |i.,i " 2^ ^ w /cz 



c^ 



as7 +u K x 



i+i 



(32) 



since 



?r=z 



and thus 



w, 



l+l^f,i 



kj 



arr +l 



M " w /ci 



/+1 



Sq/ 



Putting together (24), (25), (27) and (32) in (23): 



dE>(w)dtf'df>' 



a^' e?/ 1 ' 1 swj 



/ ^v,i-i 



= *"•%' 



(33) 



where 



the deltas for the hidden units. We show the method 
(Fig. 1) in algorithmic form, for an epoch or presenta- 
tion of the training set. 
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Figure 1. Back-propagation algorithm pseudo-code 



for all ]d in l<p<p 

1. Forward pass. Present x^ and compute 
the outputs g} 1 ' 1 of all the units. 

2. Backward pass. Compute the deltas d^'\ 

of all the units (the local gradients), as follows: 

l^c+i: &>>\= g'(sr') (y p>r tfO 



l<c±l: S^gXgT 1 ) Y 
3. Ww^Plrf" 



^' l+ \< +1 



End 



Update weights as Aw- = 



JI=1 



A^V-, 



CONCLUSION 

The strong points of ANNs are the capacity to learn from 
examples, distributed computation, tolerance to partial 
failures and the possibility to use them as black-box 
models. The procedure used to carry out the training 
process in a neural network is called the training or 
learning algorithm. Since the objective function is a 
non linear function of the free parameters, it is not pos- 
sible to find closed training algorithms for the minima. 
Preferred algorithms for the multilayer perceptron are 
the quasi-Newton and Levenberg-Marquardt methods, 
together with back-propagation to compute the gradient 
vector efficiently. 
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KEY TERMS 

Artificial Neural Network: Information processing 
structure without global or shared memory that takes the 
form of a directed graph where each of the computing 
elements ("neurons") is a simple processor with internal 
and adjustable parameters, that operates only when all 
its incoming information is available. 



Back-Propagation: Algorithm for feed-forward 
multilayer networks that can be used to efficiently 
compute the gradient vector in all the first-order 
methods. 

Feed-Forward Artificial Neural Network: Artifi- 
cial Neural Network whose graph has no cycles. 

First-Order Method: A training algorithm using 
the objective function and its gradient vector. 

Learning Algorithm: Method or algorithm by vir- 
tue of which an Artificial Neural Network develops a 
representation of the information present in the learning 
examples, by modification of the weights. 

Second-Order Method: A training algorithm 
using the objective function, its gradient vector and 
Hessian matrix. 

Weight: A free parameter of an Artificial Neural 
Network, modified through the action of a Learning 
Algorithm to obtain desired responses to certain input 
stimuli. 
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INTRODUCTION 

Supervised Artificial Neural Networks (ANN) are infor- 
mation processing systems that adapt their functionality 
as a result of exposure to input-output examples. To 
this end, there exist generic procedures and techniques, 
known as learning rules. The most widely used in the 
neural network context rely in derivative informa- 
tion, and are typically associated with the Multilayer 
Perceptron (MLP). Other kinds of supervised ANN 
have developed their own techniques. Such is the case 
of Radial Basis Function (RBF) networks (Poggio & 
Girosi, 1989). There has been also considerable work 
on the development of adhoc learning methods based 
on evolutionary algorithms. 



BACKGROUND 

The problem of learning an input/output relation from a 
set of examples can be regarded as the task of approxi- 
mating an unknown function from a set of data points, 
which are possibly sparse. Concerning approximation 
by classical feed- forward ANN, these networks imple- 
ment a parametric approximating function and have 
been shown to be able of representing generic classes 
of functions (as the continuous or integrable functions) 
to an arbitrary degree of accuracy. In general, there 
are three questions that arise when defining one such 
parameterized family of functions: 

1 . What is the most adequate parametric form for a 
given problem? 

2. How to find the best parameters for the chosen 
form? 

3 . What classes of functions can be represented and 
how well? 

The most typical problems in a ANN supervised 
learning process, besides the determination of the learn- 



ing parameters themselves, include (Hertz, Krogh & 
Palmer, 1991), (Hinton, 1989), (Bishop, 1995): 

1. The possibility of getting stuck in local optima of 
the cost function, in which conventional non-linear 
optimization techniques will stay forever. The 
incorporation of a global scheme (like multiple 
restarts or an annealing schedule) is surely to 
increase the chance of finding a better solution, 
although the cost can become prohibitedly high. 
A feed-forward network has multiple equivalent 
solutions, created by weight permutations and 
sign flips. Every local minima in a network with 
a single hidden layer of h^ units has s(h^)=h^!2' l 1 
equivalent solutions, so the chances of getting in 
the basin of attraction of one of them are reason- 
able high. The complexity of the error surface 
-especially in very high dimensions- makes the 
possibility of getting trapped a real one. 

2. Long training times, oscillations and network 
paralysis. These are features highly related to 
the specific learning algorithm, and relate to bad 
or too general choices for the parameters of the 
optimization technique (such as the learning rate). 
The presence of saddle points — regions where 
the error surface is very flat — also provoke an 
extremely slow advance for extensive periods 
of time. The use of more advanced methods that 
dynamically set these and other parameters can 
alleviate the problem. 

3. Non-cumulative learning. It is hard to take an 
already trained network and re-train it with ad- 
ditional data without losing previously learned 
knowledge. 

4. The curse of dimensionality, roughly stated as 
the fact that the number of examples needed to 
represent a given function grows exponentially 
with the number of dimensions. 

5. Difficulty of finding a structure in the training 
data, possibly caused by a very high dimension 
or a distorting pre-processing scheme. 
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Bad generalization, which can be due to several 
causes: the use of poor training data or attempts 
to extrapolate beyond them, an excessive number 
of hidden units, too long training processes or a 
badly chosen regularization. All of them can lead 
to an overfitting of the training data, in which 
the ANN adjusts the training set merely as an 
interpolation task. 

Not amenable to inspection. It is generally arduous 
to interpret the knowledge learned, especially in 
large networks or with a high number of model 
inputs. 



LEARNING IN RBF NETWORKS 

A Radial Basis Function network is a type of ANN that 
can be viewed as the solution of a high-dimensional 
curve-fitting problem. Learning is equivalent to finding 
a surface providing the best fit to the data. The RBF 
network is a two-layered feed forward network using a 
linear transfer function for the output units and a radi- 
ally symmetric transfer function for the hidden units. 
The computation of a hidden unit is expressed as the 
composition of two functions, as: 



FAx ) = fo(h(x,w,)), neR 11 }, xeR n 



(1) 



with the choice /i(x,w-)= ||x-w-||/0 (or other distance 
measure), with 0>O a smoothing term, plus an activa- 
tion g which very often is a monotonically decreasing 
response from the origin. These units are localized, in 
the sense that they give a significant response only in 
a neighbourhood of their centre Wj. For the activation 
function a Gaussian g(z)=exp(-z 2 /2) is a preferred 
choice. 

Learning in RBF networks is characterized by the 
separation of the process in two consecutive stages 
(Haykin, 1994), (Bishop, 1995): 

1 . Optimize the free parameters of the hidden layer 
(including the smoothing term) using only the 
{x}- in D. This is an unsupervised method that 
depends on the input sample distribution. 

2 . With these parameters found and frozen, optimize 
the { c •} •, the hidden-to-output weights, using the 
full information in D. This is a supervised method 
that depends on the given task. 



There are many ways of optimizing the hidden- 
layer parameters. When the number of hidden neurons 
equals the number of patterns, each pattern may be 
taken to be a center of a particular neuron. However, 
the aim is to form a representation of the probability 
density function of the data, by placing the centres in 
only those regions of the input space where significant 
data are present. One commonly used method is the 
k-means algorithm (McQueen, 1967), which in turn 
is an approximate version of the maximum-likelihood 
(ML) solution for determining the location of the means 
of a mixture density of component densities (that is, 
maximizing the likelihood of the parameters with re- 
spect to the data). The Expectation-Maximization (EM) 
algorithm (Duda & Hart, 1 973) can be used to find the 
exact ML solution for the means and covariances of 
the density. It seems that EM is superior to k-means 
(Nowlan, 1 990). The set of centres can also be selected 
randomly from the set of data points. 

The value of the smoothing term can be obtained 
from the clustering method itself, or else estimated a 
posteriori. One popular heuristic is: 




= 



V2M 



(2) 



where d is the maximum distance between the chosen 
centers and Mis the number of centers (hidden units). 
Alternatively, the method of Distance Averaging 
(Moody and Darken, 1989) can be used, which is the 
global average over all Euclidean distances between the 
center of each unit and that of its nearest neighbor. 

Once these parameters are chosen and kept constant, 
assuming the output units are linear, the (square) error 
function is quadratic, and thus the hidden-to-output 
weights can be fast and reliably found iteratively by 
simple gradient descent over the quadratic surface of 
the error function or directly by solving the minimum 
norm solution to the over determined least-squares data 
fitting problem (Orr, 1995). 

The whole set of parameters of a RBF network 
can also be optimized with a global gradient descent 
procedure on all the free parameters at once (Bishop, 
1995), (Haykin, 1994). This brings back the problems 
of local minima, slow training, etc, already discussed. 
However, better solutions can in principle be found, 
because the unsupervised solution focuses on esti- 
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mating the input probability density function, but the 
resulting disposition may not be the one minimizing 
the square error. 



EVOLUTIONARY LEARNING 
ALGORITHMS 

The alternative to derivative-based learning algorithms 
(DBLA) are Evolutionary Algorithms (EA) (Back, 
1996). Although the number of successful specific 
applications of EA is counted by hundreds (see (Back, 
Fogel & Michalewicz, 1 997) for a review), only Genetic 
Algorithms or GA (Goldberg, 1989) and, to a lesser 
extent, Evolutionary Programming (Fogel, 1992), 
have been broadly used for ANN optimization, since 
the earlier works using genetic algorithms (Montana 
& Davis, 1989). Evolutionary algorithms operate on 
a population of individuals applying the principle of 
survival of the fittest to produce better approximations 
to a solution. At each generation, a new population is 
created by selecting individuals according to their level 
of fitness in the problem domain and recombining them 
using operators borrowed from natural genetics. The 
offspring also undergo mutation. This process leads to 
the evolution of populations of individuals that are bet- 
ter suited to their environment than the individuals that 
they were created from, just as in natural adaptation. 
There are comprehensive review papers and guides 
to the extensive literature on this subject: see (Shaffer, 
Whitley & Eshelman, 1992), (Yao, 1993), (Kusgu & 
Thornton, 1994) and (Balakrishnan & Honavar, 1995). 
One of their main advantages over methods based on 
derivatives is the global search mechanism. A global 
method does not imply that the solution is not a local 
optimum; rather, it eliminates the possibility of getting 
caught in local optima. Another appealing issue is the 
possibility of performing the traditionally separated 
steps of determining the best architecture and its weights 
at the same time, in a search over the joint space of 
structures and weights. Another advantage is the use 
of potentially any cost measure to assess the goodness 
of fit or include structural information. Still another 
possibility is to embody a DBLA into a GA, using the 
latter to search among the space of structures and the 
DBLA to optimize the weights; this hybridization leads 
to extremely high computational costs. Finally, there 
is the use of EA solely for the numerical optimization 
problem. In the neural context, this is arguably the task 



for which continuous EA are most naturally suited. 
However, it is difficult to find applications in which 
GA (or other EA, for that matter) have clearly outper- 
formed DBLA for supervised training of feed-forward 
neural networks (Whitley, 1995). It has been pointed 
out that this task is inherently hard for algorithms that 
rely heavily on the recombination of potential solutions 
(Radcliffe, 1991). In addition, the training times can 
become too costly, even worse than that for DBLA. 

In general, Evolutionary Algorithms -particularly, 
the continuous ones- are in need of specific research 
devoted to ascertain their general validity as alternatives 
to DBLA in neural network optimization. Theoretical 
as well as practical work, oriented to tailor specific 
EA parameters for this task, together with a special- 
ized operator design should pave the way to a fruitful 
assessment of validity. 



FUTURE TRENDS 

Research in ANN currently concerns the development 
of learning algorithms for weight adaptation or, more 
often, the enhancement of existing ones. New archi- 
tectures (ways of arranging the units in the network) 
are also introduced from time to time. Classical neuron 
models, although useful and effective, are lessened to a 
few generic function classes, of which only a handful 
of instances are used in practice. 

One of the most attractive enhancements is the 
extension of neuron models to modern data mining 
situations, such as data heterogeneity. Although a feed- 
forward neural network can in principle approximate 
an arbitrary function to any desired degree of accuracy, 
in practice a pre-processing scheme is often applied to 
the data samples to ease the task. In many important 
domains from the real world, objects are described by 
a mixture of continuous and discrete variables, usually 
containing missing information and characterized by 
an underlying vagueness, uncertainty or imprecision. 
For example, in the well-known UCI repository (Mur- 
phy & Aha, 1991) over half of the problems contain 
explicitly declared nominal attributes, let alone other 
discrete types or fuzzy information, usually unreported. 
This heterogeneous information should not be treated 
in general as real- valued quantities. Conventional 
ways of encoding non-standard information in ANN 
include (Prechelt, 1994), (Bishop, 1995), (Fiesler & 
Beale, 1997): 
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Ordinal variables. These variables correspond 
to discrete (finite) sets of values wherein an ordering 
has been defined (possibly only partial). They are 
more than often treated as real-valued, and mapped 
equidistantly on an arbitrary real interval. A second 
possibility is to encode them using a thermometer. To 
this end, let k be the number of ordered values; k new 
binary inputs are then created. To represent value z, 
for l<z<k, the leftmost l,...,i units will be on, and the 
remaining z + l,...,k off. 

The interest in these variables relies in that they 
appear frequently in real domains, either as symbolic 
information or from processes that are discrete in nature. 
Note that an ordinal variable need not be numerical. 

Nominal variables Nominal variables are unani- 
mously encoded using a 1-out-of-k representation, 
being k the number of values, which are then encoded 
as the rows of the I^x/c identity matrix. 

Missing values Missing information is an old issue 
in statistical analysis (Little & Rubin, 1987). There are 
several causes for the absence of a value. They are very 
common in Medicine and Engineering, where many 
variables come from on-line sensors or device mea- 
surements. Missing information is difficult to handle, 
especially when the lost parts are of significant size. 
It can be either removed (the entire case) or "filled in" 
with the mean, median, nearest neighbour, or encoded 
by adding another input equal to one only if the value 
is absent and zero otherwise. Statistical approaches 
need to make assumptions about or model the input 
distribution itself. The main problem with missing data 
is that we never know if all the efforts devoted to their 
estimation will revert, in practice, in better-behaved 
data. This is also the reason why we develop on the 
treatment of missing values as part of the general dis- 
cussion on data characteristics. The reviewed methods 
pre-process the data to make it acceptable by models 
that otherwise would not accept it. In the case of miss- 
ing values, the data is completed because the available 
neural methods only admit complete data sets. 

Uncertainty. Vagueness, imprecision and other 
sources of uncertainty are considerations usually put 
aside in the ANN paradigm. Nonetheless, many vari- 
ables in learning processes are likely to bear some form 
of uncertainty. In Engineering, for example, on-line 
sensors are likely to get old with time and continuous 
use, and this may be reflected in the quality of their 
measurements. In many occasions, the data at hand 
are imprecise for a manifold of reasons: technical 



limitations, a veritable qualitative origin, or even we 
can be interested in introducing imprecision with the 
purpose of augmenting the capacity for abstraction 
or generalization (Esteva, Godo & Garcia), possibly 
because the underlying process is believed to be less 
precise than the available measures. 

In Fuzzy Systems theory there are explicit formal- 
isms for representing and manipulating uncertainty, 
that is precisely what the system best models and 
manages. It is perplexing that, when supplying this 
kind of input/output data, we require the network to 
approximate the desired output in a very precise way. 
Sometimes the known value takes an interval form: 
"between 5.1 and 5.5", so that any transformation 
to a real value will result in a loss of information. A 
more common situation is the absence of numerical 
knowledge. For example, consider the value "fairly 
tall" for the variable height. Again, Fuzzy Systems 
are comfortable, but for an ANN this is real trouble. 
The integration of symbolic and continuous informa- 
tion is also important because numeric methods bring 
higher concretion, whereas symbolic methods bring 
higher abstraction. Their combined use is likely to 
increase the flexibility of hybrid systems. For numeric 
data, an added flexibility is obtained by considering 
imprecision in their values, leading to fuzzy numbers 
(Zimmermann, 1992). 



CONCLUSION 

As explained at length in other chapters, derivative- 
based learning algorithms make a number of assump- 
tions about the local error surface and its differentiabil- 
ity. In addition, the existence of local minima is often 
neglected or overlooked entirely. In fact, the possibility 
of getting caught in these minima is more than often 
circumvented by multiple runs of the algorithm (that 
is, multiple restarts from different initial points in 
weight space). This "sampling" procedure is actually 
an implementation of a very naive stochastic process. 
A global training algorithm for neural networks is the 
evolutionary algorithm, a stochastic search training 
algorithm based on the mechanics of natural genetics 
and biological evolution. It requires information from 
the objective function, but not from the gradient vector 
or the Hessian matrix and thus it is a zero-order method. 
On the other hand, there is an emerging need to devise 
neuron models that properly handle different data types, 
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as is done in support vector machines (Shawe-Taylor 
& Cristianini, 2004), where kernel design is a current 
research topic. 
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KEY TERMS 

Architecture: The number of artificial neurons, its 
arrangement and connectivity. 

Artificial Neural Network: Information processing 
structure without global or shared memory that takes the 
form of a directed graph where each of the computing 
elements ("neurons") is a simple processor with internal 
and adjustable parameters, that operates only when all 
its incoming information is available. 



Evolutionary Algorithm: A computer simulation in 
which a population of individuals (abstract representa- 
tions of candidate solutions to an optimization problem) 
are stochastically selected, recombined, mutated, and 
then removed or kept, based on their relative fitness 
to the problem. 

Feed-Forward Artificial Neural Network: Artifi- 
cial Neural Network whose graph has no cycles. 

Learning Algorithm: Method or algorithm by vir- 
tue of which an Artificial Neural Network develops a 
representation of the information present in the learning 
examples, by modification of the weights. 

Neuron Model: The computation of an artificial 
neuron, expressed as a function of its input and its 
weight vector and other local information. 

Weight: A free parameter of an Artificial Neural 
Network, that can be modified through the action of 
a Learning Algorithm to obtain desired responses to 
certain input stimuli. 
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INTRODUCTION 

Game Theory (Von Neumann & Morgenstern, 1 944) 
is a branch of applied mathematics and economics that 
studies situations (games) where self-interested interact- 
ing players act for maximizing their returns; therefore, 
the return of each player depends on his behaviour and 
on the behaviours of the other players. Game Theory, 
which plays an important role in the social and political 
sciences, has recently drawn attention in new academic 
fields which go from algorithmic mechanism design 
to cybernetics. However, a fundamental problem to 
solve for effectively applying Game Theory in real 
word applications is the definition of well-founded 
solution concepts of a game and the design of efficient 
algorithms for their computation. 

A widely accepted solution concept of a game in 
which any cooperation among the players must be self- 
enforcing (non-cooperative game) is represented by the 
Nash Equilibrium. In particular, a Nash Equilibrium 
is a set of strategies, one for each player of the game, 
such that no player can benefit by changing his strat- 
egy unilaterally, i.e. while the other players keep their 
strategies unchanged (Nash, 1951). The problem of 
computing Nash Equilibria in non-cooperative games 
is considered one of the most important open problem 
in Complexity Theory (Papadimitriou, 2001). Daska- 
lakis, Goldbergy, and Papadimitriou (2005), showed 
that the problem of computing a Nash equilibrium in 
a game with four or more players is complete for the 
complexity class PPAD-Polynomial Parity Argument 
Directed version (Papadimitriou, 1991), moreover, 
Chen and Deng extended this result for 2-player games 
(Chen & Deng, 2005). However, even in the two play- 
ers case, the best algorithm known has an exponential 
worst-case running time (Savani & von Stengel, 2004); 
furthermore, if the computation of equilibria with 
simple additional properties is required, the problem 
immediately becomes NP-hard (Bonifaci, Di Iorio, & 
Laura, 2005) (Conitzer & Sandholm, 2003) (Gilboa & 
Zemel, 1989) (Gottlob, Greco, & Scarcello, 2003). 



Motivated by these results, recent studies have dealt 
with the problem of efficiently computing Nash Equi- 
libria by exploiting approaches based on the concepts 
of learning and evolution (Fudenberg & Levine, 1998) 
(Maynard Smith, 1982). In these approaches the Nash 
Equilibria of a game are not statically computed but 
are the result of the evolution of a system composed 
by agents playing the game. In particular, each agent 
after different rounds will learn to play a strategy that, 
under the hypothesis of agent's rationality, will be one 
of the Nash equilibria of the game (Benaim & Hirsch, 
1999) (Carmel & Markovitch, 1996). 

This article presents SALENE, a Multi-Agent 
System (MAS) for learning Nash Equilibria in non- 
cooperative games, which is based on the above men- 
tioned concepts. 



BACKGROUND 

An n-person strategic game G can be defined as a tuple 
G = (N; (A) ieN ; (r%J, where JV = { 1 , 2, . . . , n} is the 
set of players, A is a finite set of actions for player ieN, 
and f ': A 1 x . . . x A n — » 9? is the payoff function of player 
z. The set A is called also the set of pure strategies of 
player z. The Cartesian product x m A = A 1 x ... x A n 
can be denoted by A and r \ A^> W can denote the 
vector valued function whose z'th component is r\ i.e., 
r{a) = (r\a), . . . , r n (a)), so it is possible to write (N, 
A, r) for short for (JV; (A) ieN ; (r<) t J. 

For any finite set A the set of all probability distri- 
butions on A can be denoted by A (A'). An element & 
e A(A') is a mixed strategy for player z. 

A (Nash) equilibrium of a strategic game G = (N, 
A, r) is an iV-tuple of (mixed) strategies o = (o l ) . eJV? & 
e A(A Z ), such that for every z e N and any other strat- 
egy of player z, t e A(A Z ), r^T^o"') < r^o^cr 1 ), where 
f denotes also the expected payoff to player z in the 
mixed extension of the game and a" 1 represents the 
mixed strategies in o of all the other players. Basically, 
supposing that all the other players do not change their 
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strategies it is not possible for any player i to play a 
different strategy t able to gain a better payoff of that 
gained by playing o z . o' is called a Nash equilibrium 
strategy for player z. 

In 1951 J. F. Nash proved that a strategic (non- 
cooperative) game G = (N, A, r) has at least a (Nash) 
equilibrium o (Nash, 1 95 1 ); in his honour, the compu- 
tational problem of finding such equilibria is known 
as NASH (Papadimitriou, 1994). 



SOFTWARE AGENTS FOR LEARNING 
NASH EQUILIBRIA 

SALENE was conceived as a system for learning at 
least one Nash Equilibrium of a non-cooperative game 
given in the form G = (JV; (A z '). eiV ; (r*) feN ). In particular, 
the system asks the user for: 

the number n of the players which defines the set 

of players N= {1,2, ... , n} ; 

for each player ieN, the related finite set of pure 

strategies A [ and his payoff function r [ : A 1 x ... 

x A" -> 9t; 

the number k of times the players will play the 

game. 



Figure 1. The class diagram of SALENE 



FIPAAgent 



5: 



ManagerAgent 



ManagerBehaviour 



RefereeAgent 



RefereeBehaviour 



GameDefinition 



PlayerAgent 



PlayerBehaviour 



Then, the system creates n agents, one associated 
to each player, and a referee. The agents will play 
the game G k times, after each match, each agent 
will decide the strategy to play in the next match to 
maximise his expected utility on the basis of his beliefs 
about the strategies that the other agents are adopting. 
By analyzing the behaviour of each agent in all the k 
matches of the game, SALENE presents to the user an 
estimate of a Nash Equilibrium of the game. The Agent 
paradigm has represented a "natural" way of model- 
ling and implementing the proposed solution as it is 
characterized by several interacting autonomous entities 
(players) which try to achieve their goals (consisting 
in maximising their returns). 

The class diagram of SALENE is shown in Figure 
1. 

The ManagerAgent interacts with the user and it is 
responsible for the global behaviour of the system. In 
particular, after having obtained from the user the input 
parameters G and k, the ManagerAgent creates both n 
Player Agents and a RefereeAgent that coordinates and 
monitors the behaviours of the players. The Manager 
Agent sends to all the agents the definition G of the game 
then he asks the Referee Agent to orchestrate k matches 
of the game G. In each match, the Referee Agent asks 
each PlayerAgent which pure strategy he has decided 
to play, then, after having acquired the strategies from 
all players, the Referee Agent communicates to each 
PlayerAgent both the strategies played and the payoffs 
gained by all players. After playing k matches of the 
game G the Referee Agent communicates all the data 
about the played matches to the ManagerAgent which 
analyses it and properly presents the obtained results 
to the user. 

A Player Agent is a rational player that, given the 
game definition G, acts to maximise his expected util- 
ity in each single match of G without considering the 
overall utility that he could obtain in a set of matches. 
In particular the behaviour of the Player Agent z can 
be described by the following main steps: 

1 . In the first match the Player Agent i chooses to 
play a pure strategy randomly generated consider- 
ing all the pure strategies playable with the same 
probability: if lA^m the probability of choosing 
a pure strategy seA 1 is 1/m; 

2. The PlayerAgent i waits for the Referee Agent to 
ask him which strategy he wants to play, then he 
communicates to the Referee Agent the chosen 
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pure strategy as computed in step 1 if he is play- 
ing his first match or in step 4 otherwise; 

3. The Player Agent waits for the Referee Agent to 
communicate him both the pure strategies played 
and the payoffs gained by all players; 

4. The Player Agent decides the mixed strategy to 
play in the next match. In particular, the Player 
Agent updates the beliefs about the mixed strate- 
gies currently adopted by the other players and 
consequently recalculate the strategy able to 
maximise his expected utility. Basically, the Player 
Agent i tries to find the strategy o z e A(A'), such 
that for any other strategy t e A(A Z ), r'^cr 1 ) < 
rX&.c 4 ) where r z denotes his expected payoff and 
a" represents his beliefs about the mixed strategies 
currently adopted by all the other players, i.e. o~ 
i= (o j )- eN •.#, o j e A(A J ). In order to evaluate c j for 
each other player/^/ the Player Agent i considers 
the pure strategies played by the player j in all 
the previous matches and computes the frequency 
of each pure strategy, this frequency distribution 
will be the estimate for d. If there is at least an 
element in the actually computed set cr -(a*) . eN 
that differs from the set o z as computed in the 
previous match, the Player Agent z solves the 
inequality r^x^cr 1 ) < r^&.c 4 ) that is equivalent to 
solve the optimization problem P= {max(r l '(o',cr 
')), c'e A(A')}. It is worth noting that P is a linear 
optimization problem, actually, given the set o" f , 
^(g^g 4 ) is a linear objective function in o z (see 
the game definition reported in the Background 
Section), and with JA^m o'e A(A l ) is a vector % 
e 9i M such that T< sm %= 1 and for every seM %>0, 
so the constraint o / gA(A / ) is a set of m+1 linear 
inequalities. P is solved by the Player Agent by 
using an efficient method for solving problems 
in linear programming, in particular the predic- 
tor-corrector method of Mehrotra (1992), whose 
complexity is polynomial for both average and 
worst case. The obtained solution for o z is a pure 
strategy because it is one of the vertices of the 
polytope which defines the feasible region for 
P. The obtained strategy & will be played by the 
Player Agent i in the next match; r^a^cr 1 ) repre- 
sents the expected payoff to player z in the next 
match; 

5. back to step 2. 



It is worth noting that a Player Agent for choosing 
the mixed strategy to play in each match of G does 
not need to known the payoff functions of the others 
players, in fact, for solving the optimization problem 
P it only needs to consider the strategies which have 
been played by the other players in all the previous 
matches. 

The Manager Agent, receives from the Referee 
Agent all the data about the k matches of the game G 
and computes an estimate of a Nash Equilibrium of 
G, i.e. an iV-tuple o=(o / ) /GiV , o'e A(A'). In particular, in 
order to estimate o z (the Nash equilibrium strategy of 
the player z), the Manager Agent computes, on the basis 
of the pure strategies played by the player z in each 
of the k match, the frequency of each pure strategy: 
this frequency distribution will be the estimate for o z . 
The so computed set o=(o'). eiV , o'e A(A') will be then 
properly proposed to the user together with the data 
exploited for its estimation. 

SALENE has been implemented using JADE 
(Bellifemine, Poggi, & Rimassa, 2001), a software 
framework allowing for the development of multi- 
agent systems and applications conforming to FIPA 
standards (FIPA, 2006), and tested on different games 
that differ from each other both in the number and in 
the kind of Nash Equilibria. The experiments have 
demonstrated that: 

if the game has p>=l Pure Nash Equilibria and 
s>=0 Mixed Nash Equilibria the agents converge 
in playing one of the p Pure Nash Equilibria; 
in these cases, as the behaviour of each Player 
Agent converges with probability one to a Nash 
Equilibrium of the game, the learning process 
converges in behaviours to equilibrium (Foster 
& Young, 2003); 

if the game has only Mixed Nash Equilibria, 
while the behaviour of the Player Agents does 
not converge to an equilibrium, the time-average 
behaviour, i.e. the empirical frequency with which 
each player chooses his strategy, may converge 
to one of the mixed Nash Equilibria of the game; 
that is the learning process may converge in 
time average to equilibrium (Foster and Young, 
2003). 

In the next Section the main aspects related to the 
convergence properties of the approach/algorithm 
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exploited by the SALENE agents for leaning Nash 
Equilibria are discussed in a more general discussion 
about current and future research efforts. 



FUTURE TRENDS 

Innovative approaches, as SALENE, based on the 
concepts of learning and evolution have shown great 
potential for modelling and efficiently solving non-co- 
operative games. However, as the solutions of the games 
(e.g. Nash Equilibria) are not statically computed but 
are the result of the evolution of a system composed 
by interacting agents, there are several open problems 
mainly related to the accuracy of the provided solution 
that need to be tackled to allow these approaches to be 
widely exploited in concrete business application. 

The approach exploited in SALENE, which derives 
from the Fictitious Play (Robinson, 1951) approach, 
efficiently solves the problem of learning a Nash 
Equilibrium in non-cooperative games which have at 
least one Pure Nash Equilibrium: in such a case the 
behaviour of the players exactly converges to one of 
the Pure Nash Equilibria of the game (convergence in 
behaviours to equilibrium). On the contrary, if the game 
has only Mixed Nash Equilibria, the convergence of the 
learning algorithm is not ensured. Computing ex ante 
when this case happens is quite costly as it requires to 
solve the following problem: "Determining whether a 
strategic game has only Mixed Nash Equilibria", which 
is equivalent to : "Determining whether a strategic game 
does not have any Pure Nash Equilibria". This problem 
is Co-NP complete as its complement "Determining 
whether a strategic game has a Pure Nash Equilibrium" 
is NP complete (Gottlob, Greco, & Scarcello, 2003). 
As witnessed by the conducted experiments, when a 
game has only Mixed Nash Equilibria there are still 
some cases in which, while the behaviour of the players 
does not converge to an equilibrium, the time-average 
behaviour, i.e. the empirical frequency with which 
each player chooses his strategy, converges to one of 
the Mixed Nash Equilibria of the game (convergence 
in time average to equilibrium). 

Nevertheless, there are some cases in which there is 
neither convergence in behaviour neither convergence 
in time average to equilibrium; an example of such a 
case is the fashion game of Shapley ( 1 964). An important 
open problem is then represented by the characterization 
of the classes of games for which the learning algorithm 



adopted in SALENE converges; more specifically, the 
classes of games for which the algorithm: (a) conver- 
gences in behaviours to equilibrium (which implies 
the convergence in time average to equilibrium), (b) 
only convergences in time average to equilibrium; (c) 
does not converge neither in behaviours neither in time 
average. Currently, it has been demonstrated that the 
algorithm converges in behaviours or in time average 
to equilibrium for the following classes of games: 

zero-sum games (Robinson, 1951); 
games which are solvable by iterated elimina- 
tion of strictly dominated strategies (Nachbar, 
1990); 

potential games (Monderer & Shapley, 1996); 
2xN games, i.e. games with 2 players, 2 strate- 
gies for one player and N strategies for the other 
player (Berger, 2005). 

Future efforts will be geared towards : (i) completing 
the characterization of the classes of games for which 
the learning algorithm adopted in SALENE converges 
and evaluating the complexity of solving the member- 
ship problem for such a classes; (ii) evaluating different 
learning algorithms and their convergence properties; 
(ii) letting the user ask for the computation of Nash 
Equilibria with simple additional properties. 

More in general, a wide adoption of the emerg- 
ing agent-based approaches for solving games which 
model concrete business applications will depend on 
the accuracy and the convergence properties of the 
provided solutions; both aspects still need to be fully 
investigated. 



CONCLUSION 

The complexity of NASH, the problem consisting in 
computing Nash Equilibria in non-cooperative games, 
is still debated, but even in the two players case, the 
best known algorithm has an exponential worst-case 
running time. SALENE, the proposed MAS for learn- 
ing Nash Equilibria in non-cooperative games, can 
be conceived as a heuristic and efficient method for 
computing at least one Nash Equilibria in a non-coop- 
erative game represented in its normal form; actually, 
the learning algorithm adopted by the Player Agents 
has a polynomial running time for both average and 
worst case. SALENE can be then fruitfully exploited for 
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efficiently solving non-cooperative games which model 
interesting concrete problems ranging from classical 
economic and finance problems to the emerging ones 
related to the economic aspects of the Internet such as 
TCP/IP congestion, selfish routing, and algorithmic 
mechanism design. 
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KEY TERMS 

Computational Complexity Theory: A branch of 
the theory of computation in computer science which 
studies how the running time and the memory require- 
ments of an algorithm increase as the size of the input 
to the algorithm increases. 

Game Theory: A branch of applied mathematics 
and economics that studies situations (games) where 



self-interested interacting players act for maximizing 
their returns. 

Heuristic: In computer science, a technique de- 
signed to solve a problem which allows for gaining 
computational performance or conceptual simplicity 
potentially at the cost of accuracy and/or precision of 
the provided solutions to the problem itself. 

Nash Equilibrium: A solution concept of a game 
where no player can benefit by changing his strategy 
unilaterally, i.e. while the other players keep theirs 
unchanged; this set of strategies and the corresponding 
payoffs constitute a Nash Equilibrium of the game. 

NP-Hard Problems: Problems that are intrinsically 
harder than those that can be solved by a nondetermin- 
istic Turing machine in polynomial time. 

Non-Cooper ative Games : A game in which any co- 
operation among the players must be self-enforcing. 

Payoffs: Numeric representations of the utility 
obtainable by a player in the different outcomes of a 
game. 
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INTRODUCTION 

Automated Planning (AP) studies the generation of 
action sequences for problem solving. Aproblem in AP 
is defined by a state-transition function describing the 
dynamics of the world, the initial state of the world and 
the goals to be achieved. According to this definition, 
AP problems seem to be easily tackled by searching 
for a path in a graph, which is a well-studied problem. 
However, the graphs resulting from AP problems are 
so large that explicitly specifying them is not feasible. 
Thus, different approaches have been tried to address 
AP problems. Since the mid 90 's, new planning al- 
gorithms have enabled the solution of practical-size 
AP problems. Nevertheless, domain-independent 
planners still fail in solving complex AP problems, as 
solving planning tasks is a PSPACE-Complete problem 
(Bylander, 94). 

How do humans cope with this planning-inherent 
complexity? One answer is that our experience allows 
us to solve problems more quickly; we are endowed 
with learning skills that help us plan when problems 
are selected from a stable population. Inspire by this 
idea, the field of learning-based planning studies the 
development of AP systems able to modify their per- 
formance according to previous experiences. 

Since the first days, Artificial Intelligence (AI) has 
been concerned with the problem of Machine Learning 
(ML). As early as 1959, Arthur L. Samuel developed 
a prominent program that learned to improve its play 
in the game of checkers (Samuel, 1959). It is hardly 
surprising that ML has often been used to make changes 
in systems that perform tasks associated with AI, such 
as perception, robot control or AP. This article analy- 
ses the diverse ways ML can be used to improve AP 
processes. First, we review the major AP concepts and 
summarize the main research done in learning-based 
planning. Second, we describe current trends in applying 



ML to AP. Finally, we comment on the next avenues 
for combining AP and ML and conclude. 



BACKGROUND 

The languages for representing AP tasks are typically 
based on extensions of first-order logic. They encode 
tasks using a set of actions that represents the state- 
transition function of the world (the planning domain) 
and a set of first-order predicates that represent the 
initial state together with the goals of the AP task (the 
planning problem). In the early days of AP, STRIPS 
was the most popular representation language. In 1 998 
the Planning Domain Definition Language (PDDL) 
was developed for the first International Planning 
Competition (IPC) and since that date it has become 
the standard language for the AP community. In PDDL 
(Fox & Long, 2003), an action in the planning domain 
is represented by: (1) the action preconditions, a list 
of predicates indicating the facts that must be true so 
the action becomes applicable and (2) the action post- 
conditions, typically separated in add and delete lists, 
which are lists of predicates indicating the changes in 
the state after the action is applied. 

Before the mid ' 90s, automated planners could only 
synthesize plans of no more than 10 actions in an ac- 
ceptable amount of time. During those years, planners 
strongly depended on speedup techniques for solving 
AP problems. Therefore, the application of search 
control became a very popular solution to accelerate 
planning algorithms. In the late 90 's, a significant scale- 
up in planning took place due to the appearance of the 
reachability planning graphs (Blum & Furst, 1995) 
and the development of powerful domain independent 
heuristics (Hoffman & Nebel, 200 1 ) (Bonet & Geffner, 
2001). Planners using these approaches could often 
synthesize 100-action plans just in seconds. 
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At the present time, there is not such dependence 
on ML for solving AP problems, but there is a renewed 
interest in applying ML to AP motivated by three factors : 
(1) IPC-2000 showed that knowledge-based planners 
significantly outperform domain-independent planners. 
The development of ML techniques that automatically 
define the kind of knowledge that humans put in these 
planners would bring great advances to the field. (2) 
Domain-independent planners are still not able to cope 
with real- world complex problems. On the contrary, 
these problems are often solved by defining ad hoc plan- 
ning strategies by hand. ML promises to be a solution to 
automatically defining these strategies. And, (3) there is 
a need for tools that assist in the definition, validation 
and maintenance of planning-domain models. At the 
moment, these processes are still done by hand. 



LEARNING-BASED PLANNING 

This section describes the current ML techniques 
for improving the performance of planning systems. 
These techniques are grouped according to the target 
of learning: search control, domains-specific planners, 
or domain models. 

Learning Search Control 

Domain-independent planners require high search ef- 
fort, so search-control knowledge is frequently used 
to reduce this effort. Hand-coded control knowledge 
has proved to be useful in many domains, however 
is difficult for humans to formalize it, as it requires 
specific knowledge of the planning domains and the 
planner structure. Since AP's early days, diverse 
ML techniques have been developed with the aim of 
automatically learning search-control knowledge. A 
few examples of these techniques are macro-actions 
(Fikes, Hart & Nilsson, 1972), control-rules (Borrajo 
& Veloso, 1997), and case-based and analogical plan- 
ning (Veloso, 1994). 

At the present, most of the state-of-the-art planners 
are based on heuristic search over the state space (12 
of the 20 participants in IPC-2006 used this approach). 
These planners achieve impressive performance in 
many domains and problems, but their performance 
strongly depends on the definition of a good domain- 
independent heuristic function. These heuristics are 
computed solving a simplified version of the planning 



task, which ignores the delete list of actions. The solu- 
tion to the simplified task is taken as the estimated cost 
for reaching the task goals. These kinds of heuristics 
provide good guidance across the wide range of different 
domains. However, they have some faults: (1) in many 
domains, these heuristic functions vastly underestimate 
the distance to the goal leading to poor guidance, (2) 
the computation of the heuristic values of the search 
nodes is too expensive, and (3) these heuristics are 
non-admissible so heuristics planners do not find good 
solutions in terms of plan quality. 

Since evaluating a search node in heuristic planning 
is so time consuming, (De la Rosa, Garcia-Olaya & 
Borrajo, 2007) proposed using Case-based Reasoning 
(CBR) to reduce the number of explored nodes. Their 
approach stores sequences of abstracted state transi- 
tions related to each particular object in a problem 
instance. Then, with a new problem, these sequences 
are retrieved and re-instantiated to support a forward 
heuristic search, deciding the node ordering for com- 
puting its heuristic value. 

In the last years, other approaches have been devel- 
oped to minimize the negative effects of the heuristic 
through ML: (Botea, Enzenberger, Miiller & Schaef- 
fer, 2005) learned off-line macro-actions to reduce the 
number of evaluated nodes by decreasing the depth of 
the search tree. (Coles & Smith, 2007) learned on-line 
macro-actions to escape from plateaus in the search tree 
without any exploration. (Yoon, Fern & Givan, 2006) 
proposed using an inductive approach to correct the 
domain-independent heuristic in those domains based 
on learning a supplement to the heuristic from observa- 
tions of solved problems in these domains. 

All these methods for learning search-control knowl- 
edge suffer from the utility problem. Learning too much 
control knowledge can actually be counterproductive 
because the difficulty of storing and managing the 
information and the difficulty of determining which 
information to use when solving a particular problem 
can interfere with efficiency. 

Learning Domain-Specific Planners 

An alternative approach to learning search control con- 
sists of learning domain-specific planning programs. 
These programs receive as input a planning problem 
of a fixed domain and return a plan that solves the 
problem. 
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The first approaches to learn domain-specific plan- 
ners were based on supervised inductive learning; they 
used genetic programming (Spector, 1994) and deci- 
sion-list learning (Khardon, 1999), but they were not 
able to reliably produce good results. Recently, (Winner 
& Veloso, 2003) presented a different approach based 
on generalizing an example plan into a domain-specific 
planning program and merging the resulting source 
code with the previous ones. 

Domain-specific planners are also represented as 
policies, i.e., pairs of state and the preferred action 
to be executed in the state. Relational Reinforcement 
Learning (RRL) (Dzeroski, Raedt & Blocked, 1998) 
has aroused interest as an efficient approach for learning 
policies for relational domains. RRL includes a set of 
learning techniques for computing the optimal policy 
for reaching the given goals by exploring the state 
space though trial and error. The major benefit of these 
techniques is that they can be used to solve problems 
whether the action model is known or not. In the other 
hand, since RRL does not explicitly include the task 
goals in the policies, new policies have to be learned 
every time a new goal has to be achieved, even if the 
dynamics of the environment has not changed. 

In general, domain-specific planners have to deal 
with the problem of generalization. These techniques 
build planning programs from a given set of solved 
problems so cannot theoretically guarantee solving 
subsequent problems. 

Learning Domain Models 

No matter how efficient a planner is, if it is fed with a 
defective domain model, it will return defective plans. 
Designing, encoding and maintaining a domain model 
is very laborious. At the time being, planners are the 
only tool available to assist in the development of an AP 
domain model, but planners are not designed specifi- 
cally for this purpose. Domain model learning studies 
ML mechanisms to automatically acquire the planning 
action schemas (the action preconditions and post-con- 
ditions) from observations of action executions. 

Learning domain models in deterministic environ- 
ments is a well-studied problem; diverse inductive 
learning techniques have been successfully applied 
to automatically define the actions schema from ob- 
servations (Shen & Simon, 1989), (Benson, 1997), 
(Yang, Wu & Jiang, 2005), (Shahaf & Amir, 2006). In 
stochastic environments, this problem becomes more 



complex. Actions may result in innumerable different 
outcomes, so more elaborated approaches are required. 
(Pasula, Zettlemoyer & Kaelbling, 2004) presented 
the first specific algorithm to learn simple stochastic 
actions without conditional effects. This algorithm is 
based on three levels of learning: the first one consists 
of deterministic rule-learning techniques to induce 
the action preconditions. The second one relies on a 
search for the set of action outcomes that best fits the 
execution examples, and; the third one consists of 
estimating the probability distributions over the set of 
action outcomes. But, stochastic planning algorithms do 
not need to consider all the possible actions outcomes. 
(Jimenez & Cussens 2006) proposed to learn complex 
action-effect models (including conditions) for only 
the relevant action outcomes. Thus, planners generate 
robust plans by covering only the most likely execution 
outcome while leaving others to be completed when 
more information is available. 

In deterministic environments, (Shahaf & Amir, 
2006) introduced an algorithm that exactly learns 
STRIPS action schemas even if the domain is only 
partially observable. But, in stochastic environments, 
there is still no general efficient approach to learn ac- 
tion model. 



FUTURE TRENDS 

Since the appearance of the first PDDL version in IPC- 
1 998, the standard planning representation language has 
evolved to bring together AP algorithms and real-world 
planning problems. Nowadays, the PDDL 3.0 version 
for the IPC-2006 includes numeric state variables to 
support quality metrics, durative actions that allow ex- 
plicit time representation, derived predicates to enrich 
the descriptions of the system states, and soft goals 
and trajectory constraints to express user preferences 
about the different possible plans without discarding 
valid plans. But, most of these new features are not 
handled by the state-of-the-art planning algorithms: 
The existing planners usually fail solving problems that 
define quality metrics. The issue of goal and trajectory 
preferences has only been initially addressed. Time 
and resources add such extra complexity to the search 
process that a real-world problem becomes extremely 
difficult to solve. New challenges for the AP community 
are those related to developing new planning algorithms 
and heuristics to deal with these kinds of problems. As 
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it is very difficult to find an efficient general solution, 
ML must play an important role in addressing these 
new challenges because it can be used to alleviate the 
complexity of the search process by exploiting regular- 
ity in the space of common problems. 

Besides, the state-of-the-art planning algorithms 
need a detailed domain description to efficiently solve 
the AP task, but new applications like controlling un- 
derwater autonomous vehicles, Mars rovers, etc. imply 
planning in environments where the dynamics model 
may be not easily accessible. There is a current need for 
planning systems to be able to acquire information of 
their execution environment. Future planning systems 
have to include frameworks that allow the integration 
of the planning and execution processes together with 
domain modeling techniques. 

Traditionally, learning-based planners are evaluated 
only against the same planner but without learning, in 
order to prove their performance improvement. Addi- 
tionally, these systems are not exhaustively evaluated; 
typically the evaluation only focuses on a very small 
number of domains, so these planners are usually quite 
fragile when encountering new domains. Therefore, 
the community needs a formal methodology to validate 
the performance of the new learning-based planning 
systems, including mechanisms to compare different 
learning-based planners. 

Although ML techniques improve planning systems, 
existing research cannot theoretically demonstrate 
that they will be useful in new benchmark domains. 
Moreover, for time being, it is not possible to formally 
explain the underlying meaning of the learned knowl- 
edge (i.e., does the acquired knowledge subsumes task 
decomposition? a goal ordering? a solution path?). 
This point reveals that future research in AP and ML 
will also focus on theoretical aspects that address these 
issues. 



CONCLUSION 

Generic domain-independent planners are still not able 
to address the complexity of real planning problems. 
Thus, most planning systems implemented in applica- 
tions require additional knowledge to solve the real 
planning tasks. However, the extraction and compilation 
of this specific knowledge by hand is complicated. 

This article has described the main last advances 
in developing planners successfully assisted by ML 



techniques. Automatic learned knowledge is useful 
for AP in diverse ways: it helps planners in guiding 
search processes, in completing domain theories or in 
specifying particular solutions to a particular problem. 
However, the learning-based planning community can 
not only focus on developing new learning techniques 
but also on defining formal mechanisms to validate its 
performance against other generic planners and against 
other learning-based planners. 
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KEY TERMS 

Control Rule: IF-THEN rule to guide the planning 
search-tree exploration. 

Derived Predicate: Predicate used to enrich the 
description of the states that is not affected by any of 
the domain actions. Instead, the predicate truth values 
are derived by a set of rules of the form if formula(x) 
then predicate(x). 

Domain Independent Planner: Planning system 
that addresses problems without specific knowledge of 
the domain, as opposed to domain-dependent planners, 
which use domain-specific knowledge. 

Macro- Action: Planning action resulting from 
combining the actions that are frequently used together 
in a given domain. Used as control knowledge to speed 
up plan generation. 

Online Learning: Knowledge acquisition during 
a problem-solving process with the aim of improving 
the rest of the process. 

Plateau: Portion of a planning search tree where 
the heuristic value of nodes is constant or does not 
improve. 

Policy: Mapping between the world states and the 
preferred action to be executed in order to achieve a 
given set of goals. 

Search Control Knowledge: Additional knowledge 
introduced to the planner with the aim of simplifying 
the search process, mainly by pruning unexplored 
portions of the search space or by ordering the nodes 
for exploration. 
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INTRODUCTION 

The aim of this paper is to present a typology of career 
paths in France drawn up with the Kohonen algorithm 
and its extension to a clustering method of life history 
analysis based on the use of Self Organizing Maps 
(SOMs). Several methods have previously been pre- 
sented for transforming qualitative into quantitative in- 
formation so as to be able to apply clustering algorithms 
such as SOMs based on the Euclidean distance. Our 
approach consists in performing quantitative encod- 
ing on labor market situation proximities across time. 
Using SOMs, the preservation of the topology also 
makes it possible to check whether this new method of 
encoding preserves the particularities of the life history 
according to our economic approach to careers. Lastly, 
this quantitative encoding preprocessing, which can 
be easily applied to analysis methods of life history, 
completes the set of methods extending the use of SOM 
to qualitative data. 



BACKGROUND 

Several methods are generally used to study the dy- 
namic aspects of careers. The first method, which 
estimates some reduced-form transition models, has 
been extensively used in labor microeconometrics, 
using event-history models for continuous-time data 
or discrete-time panel data with Markov processes. 
Those of the second kind, which include the method 
presented here, are sequence analysis methods deal- 
ing with complex information about individual labor 
market histories, such as the various states undergone, 
the duration of the spells, multiple transitions between 
the states, etc.. The idea was to empirically generate a 
statistical typology of sequences by performing cluster 
analysis (Lebart, 2006). This method thus makes it pos- 



sible to define "cluster paths" constituting endogenous 
variables and explained in terms of individual charac- 
teristics such as gender, educational level or parental 
socio-economic status. The optimal matching method, 
which has been widely used in social science since the 
pioneering paper by Abott (Abbott & Hrycak, 1990) , 
is an attractive solution for analysing longitudinal data 
of this kind. The basic idea underlying this method is 
to take a pair of sequences and calculate the cost of 
transforming them into each other by performing a 
series of elementary operations (insertion, deletion and 
substitution). However, this method has been heavily 
criticized because it may be difficult to determine the 
values of these elementary operations. Here we adopt 
another strategy. First, in order to classify sequences 
into groups, we have defined a measure of the distance 
between each trajectory, which is coherent with our data 
and with some well-known theoretical hypotheses in 
the field of labor economics. We then use Self Organiz- 
ing Maps (the Kohonen algorithm) for classification 
and purposes. 

Self Organizing Maps (see Kohonen, 2001, Fort, 
2006) are known to be a powerful clustering and pro- 
jection method. Since this method accounts efficiently 
for changes occurring with time, SOMs yield accurate 
predictions (see for example Cotrell , Girard & Rous- 
set ,1998, Dablemont, Simon, Lendasse, Ruttiens, 
Blayo & Verleysen, 2003, Souza, Barreto & Mota, 
2005). Life histories can be considered as a qualita- 
tive record of information, while SOMs are based on 
Euclidean distance. Many attempts have been made to 
transform qualitative variables into quantitative ones: 
using for example the Burt description (see the KACM 
presentation in Cottrel & Letremy, 1995) or using the 
multidimensional scaling (Miret, Garcia-Lagos, Joya, 
Arazoza & Sandoval, 2005). In our approach, the 
quantitative recoding focuses on the proximity between 
items considering particularities of the data (a life his- 
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tory) according to our economic approach. When the 
preprocessing of recoding is performed, Self Organizing 
Maps is a useful clustering tool, first considering its 
pre-mentioned clustering and projection qualities and 
also because of its ability to make the efficiency of our 
new encode emerge. 



CLUSTERING LIFE HISTORY WITH 
SOM 

An Example of a Life History 

Career Paths 

Labor economists have generally assumed that the 
beginning of a career results from a matching process 
(Jovanovich, 1979). Employers and job seekers lack 
information about each other: employers need to know 
how productive their potential employee is and job 
applicants want to know whether the characteristics of 
the job correspond to their expectations. Job turnover 
and temporary employment contracts can therefore 
be viewed as the consequences of this trial-by-error 
process. However, individuals' first employment situ- 
ations may also act as a signal of employability to the 
labor market. For example, a long spell of unemploy- 
ment during the first years in a person's career may be 
interpreted by potential employers as sign of low work 
efficiency; whereas working at a temp agency may 
be regarded as a sign of motivation and adaptability. 
This is consistent with the following path dependency 
hypothesis: the influence of past job experience on the 
subsequent career depends on the "cost" associated 
with the change of occupational situation. However, 
empirical studies have shown that employers mainly 
recruit on the basis of recent work experience (Allaire, 
Cahuzac & Tahar, 2000). The effects of less recent 
employment situations on a person's career therefore 
decrease over time. 

Data 

The data used in this study were based on the "Genera- 
tion 98" survey carried out by Cereq: 22 000 young 
people who had left initial training in 1 998 at all levels 
and in all training specializations were interviewed in 
spring 2001 and 2003 and autumn 2005. This sample 



was representative of the 750 000 young people leav- 
ing the education system for the first time that year in 
France. This survey provided useful information about 
the young people's characteristics (their family's socio- 
economic status, age, highest grade completed, highest 
grade attended, discipline, any jobs taken during their 
studies, work placement) and the month-by-month 
work history from 1998 to 2005. We therefore have 
a complete and detailed record of the labor market 
status of the respondents during the 88-month period 
from July 1998 to November 2005. Employment spells 
were coded as follows, depending on the nature of 
the labor contract: 1 = permanent labor contract, 2 = 
fixed term contract, 3 = apprenticeship contract, 4 = 
public temporary labor contract, 5 = interim/temping). 
Other unemployed situations were coded as follows: 6 
= unemployment, 7 = inactivity, 8 = military service, 
9 = at school. 

Preprocessing Phase: Life History 
Encoding 

The encoding of the trajectories involved a two-step 
preprocessing phase : defining a distance between 
states including time dynamics and the resulting 
quantitative encoding of trajectories. These two steps 
refer to the specificity of the data set structures of life 
history samples: the variables items (the states) are 
some qualitative information while the variables order 
records some quantitative information (the timing and 
the duration of events). 

The Distance Between Situations 

Working with pairs (state, time), called situations, al- 
lows to include the time dynamics in the proximities 
between occupational states. The proximity between 
two situations is measured on the basis of their common 
future, in line with our Economic approach. A situation 
is assumed as a potential for its own future, depending 
on its influence on this future. The similarity between 
two situations is deduced from comparisons between 
their referring potential. The potential future P s of a 
situation S among n monthly periods and p states is 
defined as the pxn dimensional vector given in (1). Its 
components P s s ' are the product of terms (p and (3. cp 
measures the flow between situation S and any situation 
S' as the empirical probability of reaching S' starting 
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from S. It is also the empirical probability of an indi- 
vidual i being in any future situation S' conditionally 
of being at the present in S. The coefficient of temporal 
inertia (3 weights the influence of S' on P s according 
to the Economic approach. It is a decreasing function 
of the time delay (t'-t). In the career paths application, 
the function chosen is the inverse of the delay and for 
the past. Lastly, a ensures that potential futures P s will 
be profiles. The natural distance between situations is 
therefore the x 2 distance between their potential future 
profiles. 



Ps=(s,t) =^^ miKj S ; ) ) m^s 



S'=(2,t') 



,P(t')0> 
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S 
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with the formula (2). Applying to the situations, the 
principal components of inertia (the principal events) 
are computed as the principal component vectors of 
the matrix A. 

Trajectories can then be described in the principal 
events space: performing the traditional binary encod- 
ing (3) of the trajectory T. is equivalent to performing 
a linear encoding through the situations (4) and then 
also through the principal events E (5). 
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The Trajectory Encoding 

In the present case of equi- weighting, the inertia of the 
space of situations results from the distances previ- 
ously computed. The principal components of inertia, 
called here principal events, can therefore be deduced. 
The term "event" refers to a combination of the point 
in time, the duration and the occupational status. The 
quantitative encoding of trajectories proposed here 
results from their description in the "events" space. 

The process used here is in line with J.R Benzecri's 
one (Benzecri, 1973), which explains how: when 
considering a set of situations {S 1 }, its center of grav- 
ity G and the matrix recording squares of distance 
(d jj ) between elements 8 and 8', one can deduce the 
matrix of scalar products A between any vectors G8 



CLASSIFICATION OF LIFE HISTORIES 
WITH SOM: A TYPOLOGY OF CAREER 
PATHS IN FRANCE AND ECONOMIC 
INTERPRETATIONS 

The result of the typology of career paths in France using 
Self Organizing Maps with a 10x10 grid is presented 
in Figure 1 . In each unit of the map, a chronogram 
describes the characteristics of the class (the career 
path). Chronograms show the evolution in time of the 
proportional contribution (in percentage) of each state 
to the classes. On the one hand, the SOM topology 
reflects the continuity of the evolution in time. On the 
other hand, similarities between situations give rise 
to the mixing of classes (see for example cluster 95) 
or proximities on the map between two clusters (for 
example, between clusters 7 1 and 72 - eighth line, first 
and second column) although few individuals are in 
the same state at the same time. The map thus makes 
it possible to assess the efficiency of the encoding 
process. 

The Kohonen map displays a concise vision of the 
types of career paths occurring during the first seven 
years of working life. In general, most of the clusters 
describe a direct school-to-work transition process, 
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Figure 1. Typology of career paths with SOM: each unit on the map gives a chronogram of the evolution in time 
of the proportional contribution (as a percentage) of any occupational position. Two populations of closed units 
have similar chronogram. 




which can be characterized by an immediate access 
to a permanent contract or an indirect access during 
the first few years. In the upper-left-hand corner of 
the map, mainly the five clusters of the two first lines, 
career paths are characterized by a high level of ac- 
cess to employment with permanent contracts. Young 
people rapidly gained access to a permanent contract 
and kept it until the end of the observation period. In 
upper-right-hand corner of the map, access to a per- 
manent contract was less direct: during the first year, 
young people first obtained a temporary contract or 
spent ten months in compulsory military service before 
obtaining a permanent contract, the upper part of the 
map, the bottom part describes career paths with longer 
period of temporary contracts and/or unemployment. 
In the lower-right-hand corner, access to a permanent 
contract is becoming rare during the first few years. 
However, more than ninety per cent of young people 
have obtained a final permanent position (in the last 
column of the map). In the bottom lines, a five-year 
public policy contract called "Emploi Jeunes" features 



strongly instead of the classical fixed term contract. The 
lower-left-hand corner of the map shows more unstable 
trajectories which end in a temporary position: seven 
years after leaving school, people have a temporary 
contract (in the last two clusters in the first column) 
or are unemployed (second and third cells on the last 
line). The chronograms situated in the middle-left- 
hand part of the map highlight how the longitudinal 
approach is interesting to understand the complexity 
of transition processes: the young people here were 
directly recruited for a permanent job, but five or six 
years after graduating, they lost this permanent job. 
This turn of events can be explained by the strong 
change in the economic environment which occurred 
during this period; the years 2003 and 2004 correspond 
to a dramatic growth of youth unemployment on the 
French labor market. 

What role does each individual characteristic play in 
the development of these career paths? Several factors 
may explain the labor market opportunities of school- 
leavers: human capital factors, parents' social class, 



1032 



A Longitudinal Analysis of Labour Market Data with SOM 



Figure 2. Career path typology by educational levels 
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Educational levels (dark = without qualifications; medium 
= secondary educational graduates; light = ligher educa- 
tional graduates). 



and other factors responsible for inequalities on the 
labor market, such as parents' nationality and gender. 
The distribution of these characteristics was included 
graphically in each cell on the map. Figure 2, which 
gives the distribution in terms of educational level, 
clearly shows that educational level strongly affected 
the career path. Higher educational graduates feature 
much more frequently in the upperleft-hand corner of 
the map, whereas school leavers without any qualifica- 
tions occur more frequently in the bottom part. Figure 
3 shows a similar pattern as far as gender is concerned: 
there are much higher percentages of females than males 
in the most problematic career paths, which suggests 
the occurrence of gender segregation or discriminatory 
practices on the French labor market. The differences 
are less conspicuous as far as the father's nationality 
is concerned (Figure 4). However, the results obtained 
here also suggests that children with French parents 
have better chances than the others of finding "safe" 
career paths. 




Figure 3. Career path typology by gender 



Figure 4. Career path typology by father's national- 
ity 
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FUTURE TRENDS 

The relevance of the method presented here concerns 
both aspects: the preprocessing and SOMs' result. The 
advantages of SOM depend on the distance chosen, 
which must enable the associated algorithm to preserve 
the proximities between situations. The relevance of 
the preprocessing stage in the method will therefore 
be confirmed if it enhances the reliability of the SOMs 
(Debodt, Cottrell & Verleysen, 2002, and Rousset, 
Guinot & Maillet, 2006, have presented a method of 
measuring and a method of increase reliability, respec- 
tively). On the other hand, the reliability also depends 
on the choice of the future weighting function, function 
(3 in formula (1). By consequence, function (3 could 
be determined here with a view to the reliability of 
the SOM results. But unfortunately, in general, this 
approach may be counter-productive in some cases. In 
the case of career paths, for example, it would lead to 
weighting the long term future, which would increase 
the robustness but would not be suitable from the point 
of view of the Economic investigation. This problem 
arises in many general contexts where the main ef- 
fects of the present on the future are short term effects, 
whereas the reliability increases in the long term. The 
main criterion used to choose function (3 must therefore 
be the topic of interest. 

Further studies are now required to improve the 
reliability of this method: first function (3 needs to be 
defined more closely and secondly, the validity of the 
method needs to be tested after enhancing the reliability 
that of the SOM topology obtained after performing 
the preprocessing step described above). It might also 
be worth investigating the use of Markovian models 
to define function (3 in particular, as well as to study 
career paths in general. This method will also have to 
be applied in the future to other samples. 



CONCLUSION 

The aim of this study was to analyze the early career 
of French schools leavers using Self Organizing Maps. 
This empirical analysis showed that career paths are 
strongly segmented. Although most of the "career 
paths" studied were characterized by stabilization on 
the labor market at some point or another, some of them 
show the great difficulties encountered by labor mar- 
ket entrants. Obtaining a permanent contract does not 



actually guarantee life-long employment. In addition, 
the econometric analysis carried out in the second part 
of this study shows that the diversity of career paths 
can be partly explained by the educational levels and 
individual characteristics of school leavers. 

In the present method of analyzing information on 
individuals ' traj ectories in time through a finite number 
of states, two important aspects are combined: the en- 
coding of the data and the analysis of the data presented 
in the form of SOMs. The first aspect avoids the well 
known problem of skew present with qualitative en- 
coding, including when it is linked to the evolution in 
time. Self organizing maps are a natural approach to the 
data analysis, since this tool combines the advanteges 
of clustering and representation methods. The method 
described here turns out to be an efficient means of 
investigating changes with time and the proximities 
between situations. In addition, the preservation of 
the topology was found to be a useful property, which 
makes it possible to assess the efficiency of the recod- 
ing. In conclusion, the method presented here could 
easily be used to analyze any life history. 
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KEY TERMS 

Careers Paths: Sequential monthly position among 
several pre-defined working categories. 

Distibutional Equivalency: Property of a distance 
that allows to group two modalities of the same variable 
having identical profiles into a new modality weighted 
with the sum of the two weights. 

Markov Process: Stochastic process in wich the 
new state of a system depends on the previous state or 
a finite set of previous states. 

Optimal Matching: Statistical method issued 
from biology abble to compare two sequences from a 
predifined cost of substitution. 

Preservation of Topology: After learning, observa- 
tions associated to the same class or to « close » classes 
according to the definition of the neighborhood and 
given by the network structure are « close » according 
to the distance in the input space. 

Self-Organizing Maps by Kohonen: A neural 
network unsupervised method of vector quantization 
widely used in classification. Self-Organizing Maps 
are a much appreciated for their topology preserva- 
tion property and their associated data representation 
system. These two additive properties come from a 
pre-defined organization of the network that is at the 
same time a support for the topology learning and its 
representation. 

X 2 Distance: Distance having certain specific prop- 
erties such as the distibutional equivalency. 
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INTRODUCTION 

To adapt users' input and tasks an interactive system 
must be able to establish a set of assumptions about 
users' profiles and task characteristics, which is often 
referred as user models. However, to develop a user 
model an interactive system needs to analyze users' 
input and recognize the tasks and the ultimate goals 
users trying to achieve, which may involve a great 
deal of uncertainties. In this chapter the approaches 
for handling uncertainty are reviewed and analyzed. 
The purpose is to provide an analytical overview and 
perspective concerning the major methods that have 
been proposed to cope with uncertainties. 

Approaches for Handling Uncertainties 

For a long time, the Bayesian model has been the primary 
numerical approach for representation and inference 
with uncertainty. Several mathematical models that are 
different from the probability prospective have also 
been proposed. The main ones are Shafer-Dempster's 
Evidence Theory (Belief Function) (Shafer, 1976; 
Dempster, 1976) and Zadeh's Possibility Theory (Za- 
deh, 1984). There have also been some attempts to 
handle the problem of incomplete information using 
classical logic. Many approaches to default reasoning 
logic have been proposed, and study of non-monotonic 
logic has gained much attention. These approaches can 
be classified into two categories : numerical approaches 
and non-numerical approaches. 

1. Probability and Bayesian Theory. There is sup- 
port for the theoretical necessity and justification 
of using a probability framework for knowledge 
representation, evidence combination and propa- 
gation, learning ability, and clarity of explanation 
(Buchana and Smith, 1 988). Bayesian processing 
remains the fundamental idea underlying many 



new proposals that claim to handle uncertainty 
efficiently. 

In all the practical developments to date, the Bayes- 
ian formula and probability values have been used as 
some kind of coefficients to augment deterministic 
knowledge represented by production rules (Barr and 
Feigenbaum, 1982). Some intuitive methods for combi- 
nation and propagation of these values have been sug- 
gested and used. One such case is the use of Certainty 
Factors (CF) in MYCIN (Shortliffe and Buchanan, 
1976). Rich also use a simplified CF approach in user 
modeling system GROUNDY (Rich, 1979). 

However, some objections against such probabilistic 
methods of accounting for uncertainty have been raised 
(Karnal and Lemmer, 1986). One of the main objections 
is that these values lack any definite semantics because 
of the way they have been used. Using a single number 
to summarize uncertainty information has always been 
a contested issue (Heckerman, 1986). 

The Bayesian approach requires that each piece 
of evidence be conditionally independent. It has been 
concluded that the assumptions of conditional inde- 
pendence of the evidence under the hypotheses are 
inconsistent with the other assumptions of exhaustive 
and mutually exclusive space of hypotheses. Specifi- 
cally, Pednault et al. (198 1) show that, under these as- 
sumptions, a probabilistic update could take place if 
there were more than two competing hypotheses. Pearl 
(1985) suggests that the assumption of conditional 
independence of the evidence under the negation of 
the hypotheses is over-restrictive. For example, if the 
inference process contains multiple paths linking the 
evidence to the same hypothesis, the independence is 
violated. Similarly, the required mutual exclusiveness 
and exhaustiveness of the hypotheses are not very real- 
istic. This assumption would not hold if more than one 
hypothesis occurred simultaneously and is as restric- 
tive as the single-default assumption of the simplest 
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diagnosing systems. This assumption also requires that 
every possible hypothesis is known a priori. It would 
be violated if the problem domain were not suitable to 
a closed-world assumption. 

Perhaps the most restrictive limitation of the Bye- 
sian approach is its inability to represent ignorance. 
The Bayesian view of probability does not allow one 
to distinguish uncertainty from ignorance. One cannot 
tell whether a degree of belief was directly calculated 
from evidence or indirectly inferred from an absence 
of evidence. In addition, this method requires a large 
amount of data to determine the estimates for prior and 
conditional probabilities. Such a requirement becomes 
manageable only when the problem can be represented 
as a sparse Bayesian network that is formed by a 
hierarchy of small clusters of nodes. In this case, the 
dependencies among variables (nodes in the network) 
are known, and only the explicitly required conditional 
probabilities must be obtained (Pearl, 1988). 

2. The Dempster-Shafer Theory of Evidence. The 
Dempster-Shafer theory, proposed by Shafer 
(Shafer, 1976), was developed within the frame- 
work of Dempster's work on upper and lower 
probabilities induced by a multi-valued mapping 
(Dempster, 1967). Like Bayesian theory, this 
theory relies on degrees of belief to represent 
uncertainty. However, it allows one to assign a 
degree of belief to subsets of hypotheses. Accord- 
ing to the Dempster-Shafer theory, the feature of 
multi-valued mapping is the fundamental reason 
for the inability of applying the well-known theo- 
rem of probability that determines the probability 
density of the image of one-to-one mapping (Co- 
hen, 1983). In this context, the lower probability 
is associated with the degree of belief and the 
upper probability with a degree of plausibility. 
This formalism defines certainty as a function that 
maps subsets of a proposition space on the [0,1] 
scale. The sets of partial beliefs are represented 
by mass distributions of a unit of belief across 
the space of propositions. These distributions are 
called the basic probability assignment. The total 
certainty over the space is 1 . A non-zero BPA can 
be given to the entire proposition space to rep- 
resent the degree of ignorance. The certainty of 
any proposition is then represented by the interval 
characterized by upper and lower probabilities. 



Dempster's rule of combination normalizes the 
intersection of the bodies of evidence from the 
two sources by the amount of non-conflictive 
evidence between the sources. 

This theory is attractive for several reasons. First, 
it builds on classical probability theory, thus inheriting 
much of its theoretical foundations. Second, it seems 
not to over-commit by not forcing precise statements 
of probabilities: its probabilities do not seem to pro- 
vide more information than is really available. Third, 
it reflects the degree of ignorance of the probability 
estimate. Fourth, the Dempster-Shafer theory provides 
rules for combining probabilities and thus for propa- 
gating measures through the system. This also is one 
of the most controversial points since the propagation 
method is an extension of the multiplication rule for 
independent events. Because many applications involve 
dependent events, the rule might be inapplicable by 
classical statistical criteria. The tendency to assume 
that events are independent unless proven otherwise 
has stimulated a large proportion of the criticism of 
probability approaches. Dempster-Shafer theory suffers 
the same problem (Bhatnager and Kanal, 1986). 

In addition, there are two problems with Demp- 
ster-Shafer approach. The first problem is computa- 
tional complexity. In the general case, the evaluation 
of the degree of belief and upper probability requires 
exponential time in the cardinality of the hypothesis set. 
This complexity is caused by the need for enumerating 
all the subsets of a given set. The second problem in 
this approach results from the normalization process 
presented in both Dempster 's and Shafer 's work. Zadeh 
has argued that this normalization process can lead to 
incorrect and counter- intuitive results (Zadeh, 1984). 
By removing the conflicting parts of the evidence and 
normalizing the remaining parts, important informa- 
tion may be discarded rather than utilized adequately. 
Dubois and Prade (1985) have also shown that the 
normalization process in the rule of evidence combina- 
tion creates a sensitivity problem, where assigning a 
zero value or a very small value to a basic probability 
assignment causes very different results. 

Based on Dempster-Shafer theory, Garvey et al. 
( 1 982) proposed an approach called Evidential Reason- 
ing that adopts the evidential interpretation of the degree 
of belief and upper probabilities. This approach defines 
the likelihood of a proposition as a subinterval of the 




1037 



Managing Uncertainties in Interactive Systems 



unit interval [0,1]. The lower bound of this interval is 
the degree of support of the proposition and the upper 
bound is its degree of plausibility. When distinct bod- 
ies of evidence must be pooled, this approach uses the 
same Dempster-Shafer techniques, requiring the same 
normalization process that was criticized by Zadeh 
(Zadeh, 1984). 

3. Fuzzy Sets and Possibility Theory. The theory of 
possibility was proposed independently by Zadeh, 
as a development of fuzzy set theory, in order 
to handle vagueness inherent in some linguistic 
terms (Zadeh, 1978). For a given set of hypoth- 
eses, a possibility distribution may be defined in 
a way that is very similar to that of a probability 
distribution. However, there is a qualitative dif- 
ference between the probability and possibility 
of an event. The difference is that a high degree 
of possibility does not imply a high degree of 
probability, nor does a low degree of probability 
imply a low degree of possibility. However, an 
impossible event must also be improbable. More 
formally, Zadeh defined the concept of a possibil- 
ity distribution. 

The concept of possibility theory has been built upon 
fuzzy set theory and is well suited for representing the 
imprecision of vague linguistic predicates. The vague 
predicate induces a fuzzy set and the corresponding 
possibility distribution. From a semantic point of 
view, the values restricted by a possibility distribution 
are more or less all the eligible values for a linguistic 
variable. This theory is completely feasible for every 
element of the universe of discourse. 

4. Theory of Endorsement. A different approach to 
uncertainty representation was proposed by Cohen 
(Cohen, 1983), which is based on a qualitative 
theory of "endorsement." According to Cohen, the 
records of the factors relating to one's certainty 
are called endorsements. Cohen's model of en- 
dorsement is based on the explicit recording of the 
justifications for a statement, normally requiring 
a complex data structure of information about the 
source. Therefore, this approach maintains the 
uncertainty. The justification is classified accord- 
ing to the type of evidence for a proposition, the 
possible actions required to solve the uncertainty 
of that evidence, and other related features. 



Endorsements can provide a good mechanism for 
the explanations of reasoning, since they create and 
maintain the entire history of justifications (z. e. , reasons 
for believing or disbelieving a proposition) and the 
relevance of any proposition with respect to a given 
goal. Endorsements are divided into five classes: rules, 
data, task, conclusion, and resolution. Cohen points 
out that the main difference between the numerical 
approaches and the endorsement-based approach, 
specifically with respect to chains of inferences, is that 
reasoning in the former approach is entirely automatic 
and non-reflective, while the latter approach provides 
more information for reasoning about uncertainty. 
Consequently, reasoning in the latter approach can be 
controlled and determined by the quality and avail- 
ability of evidence. 

Endorsements provide the information necessary to 
many aspects of reasoning about uncertainty. Endorse- 
ments are used to schedule sure tasks before unsure 
ones, to screen tasks before activating them, to deter- 
mine whether a proposition is certain enough for some 
purpose, and to suggest new tasks when old ones fail 
to cope with uncertainty. Endorsements distinguish dif- 
ferent kinds of uncertainty, and tailor reasoning to what 
is known about uncertainty. However, Bonissone and 
Tong (1995) argue that combinations of endorsements 
in a premise (i.e., proposition), propagation of endorse- 
ments to a conclusion, and ranking of endorsements 
must be explicitly specified for each particular context. 
This creates potential combinatorial problems. 

5. Assumption based reasoning and non-monotonic 
logic: Inthereasoned-assumptions approach pro- 
posed by Doyle ( 1 979), the uncertainty embedded 
in an implication rule is removed by listing all the 
exceptions to that rule. When this is not possible, 
assumptions are used to show the typicality of a 
value (i.e., default values) and defeasibility of a 
rule (i.e., liability to defeat of reason). In classi- 
cal logic, if a proposition C can be derived from 
a set of propositions S, and if S is a subset of T, 
then C can also be derived from T. As a system's 
premises increase, its possible conclusions at 
least remain constant and more likely increase. 
Deductive systems with this property are called 
monotonic. This kind of logic lacks tools for 
describing how to revise a formal theory to deal 
with inconsistencies caused by new information. 
McDermott and Dole proposed a non-monotonic 
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logic to cope with this problem (McDemott and 
Doyle, 1980). 

When an assumption used in the deductive pro- 
cess is found to be false, non-monotonic mechanisms 
must be used to keep the integrity of the statements 
(Doyle, 1979). However, this approach lacks facilities 
for computing degrees of belief. Bonissone and Tong 
(1995) suggest that assumption-based systems can 
cope with cases of incomplete information, but they are 
inadequate in handling the imprecise information. In 
particular, they cannot integrate probabilistic measures 
with reasoned assumptions. Furthermore, such systems 
rely on the precision of the defaulted values. On the 
other hand, when specific information is missing, the 
system should be able to use analogous or relevant 
information inherited from some higher-level concept. 
This surrogate for the missing information is generally 
fuzzy or imprecise and provides limited constraints on 
the value of the missing information. 

In the inference system employing non-monotonic 
logic, assumptions are made that may have to be revised 
in the light of new information. They have the property 
that at any given inference stage, more than one mutually 
consistent set of conclusions can be derived from the 
available data and possible assumptions. Such conclu- 
sions may be invalidated as new data is considered to 
be incompatible with some default assumptions. The 
inference system requires that justifications for any 
conclusion are recorded during the inference process 
and used for dependency-directed backtracking during 
the revision of beliefs. This is implemented by the Truth 
Maintenance System (TMS) (Doley, 1979). 

The weakness of non-monotonic logic is that in 
standard non-monotonic logic the only message con- 
veyed by a contradiction is that a piece of information 
previously believed true is actually false (for the time 
being). However, the real contents of the inconsistency 
being discovered may not be as reliable as was assumed, 
or it may be that a subject is not in a well-ordered state, 
or a mixture of both (Bahatnagar and Kanal, 1986). In 
addition, since the TMS examines the new informa- 
tion one piece at a time, it lacks the ability to detect 
noise input that should be ignored. This weakness is 
crucial to the task of pattern recognition (Chen and 
Norcio, 1997). 



CONCLUSION 

This chapter analyzes approaches of handling them in 
an adaptive human computer interface. Each approach 
can only deal with a particular type of uncertainty 
problems effectively. The interface system needs more 
comprehensive approach for uncertainty management 
due to various sources of uncertainties in human 
machine dialog. Especially, since human-machine 
dialog tend to be context-dependent, the management 
of uncertainty must provide a pattern-formatted view 
for user modeling. In other word, a user modeling 
system must examine the user input based the context 
of the dialog to obtain a complete and consistent user 
profiles. . Some non-traditional approaches have been 
proposed to handle uncertainties in interactive systems, 
such as neural networks and generic algorithms (Chen 
and Norcio 1997), because they have strong ability of 
pattern recognition and classification. However, the 
conversion between non-numerical user input and 
numerical input for neural network processing still 
involves a great deal of uncertainties. 
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KEY TERMS 

Bayesian Theory: Also known as Bayes' rule or 
Bayes' law. It is a result in probability theory, which 
relates the conditional and marginal probability distri- 
butions of random variables. In some interpretations of 
probability, Bayes' theory tells how to update or revise 
beliefs in light of new evidences. 

Default Reasoning: A non-monotonic logic pro- 
posed by Raymond Reiter to formalize reasoning with 
default assumptions. Default reasoningc can express 
facts like "by default, something is true"; by contrast, 
standard logic can only express that something is true 
or that something is false. This is a problem because 
reasoning often involves facts that are true in the 
majority of cases but not always. A classical example 
is: "birds typically fly". This rule can be expressed in 
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standard logic either by "all birds fly", which is in- 
consistent with the fact that penguins do not fly, or by 
"all birds that are not penguins and not ostriches and 
... fly", which requires all exceptions to the rule to be 
specified. Default logic aims at formalizing inference 
rules like this one without explicitly mentioning all 
their exceptions 

Non-Monotonic Logic: A formal logic whose 
consequence relation is not monotonic. Most studied 
formal logics have a monotonic consequence relation, 
meaning that adding a formula to a theory never pro- 
duces a reduction of its set of consequences. Intuitively, 
monotonicity indicates that learning a new piece of 
knowledge cannot reduce the set of what is known. A 
monotonic logic cannot handle various reasoning tasks 
such as reasoning by default. 

Possibility Theory: Amathematical theory for deal- 
ing with certain types of uncertainty and is an alterna- 
tive to probability theory. Professor Lotfi Zadeh first 
introduced possibility theory in 1978 as an extension 
of his theory of fuzzy sets and fuzzy logic. 

Shafer-Dempster's Evidence Theory: A math- 
ematical theory of evidence based on belief functions 
and plausible reasoning, which is used to combine 
separate pieces of information (evidence) to calculate 
the probability of an event. The theory was developed 
by Arthur P. Dempster and Glenn Shaf er. 



Theory of Endorsement: An approach to repre- 
sent uncertainty proposed by Cohen, which is based 
on a qualitative theory of "endorsement." According 
to Cohen, the records of the factors relating to one's 
certainty are called endorsements. Cohen's model of 
endorsement is based on the explicit recording of the 
justifications for a statement, normally requiring a 
complex data structure of information about the source. 
Therefore, this approach maintains the uncertainty. 
The justification is classified according to the type of 
evidence for a proposition, the possible actions required 
to solve the uncertainty of that evidence, and other 
related features. 

Truth Maintenance System: A knowledge rep- 
resentation method for representing both beliefs and 
their dependencies. The name truth maintenance is 
due to the ability of these systems to restore consis- 
tency. There are two major truth maintenance systems: 
single-context and multi-context truth maintenance. 
In single context systems, consistency is maintained 
among all facts in memory (database). Multi-context 
systems allow consistency to be relevant to a subset 
of facts in memory (a context) according to the history 
of logical inference. This is achieved by tagging each 
fact or deduction with its logical history. Multi-agent 
truth maintenance systems perform truth maintenance 
across multiple memories, often located on different 
machines. 
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INTRODUCTION 

Many-objective evolutionary optimisation is a recent 
research area that is concerned with the optimisation of 
problems consisting of a large number of performance 
criteria using evolutionary algorithms. Despite the 
tremendous development that multi-objective evolu- 
tionary algorithms (MOEAs) have undergone over the 
last decade, studies addressing problems consisting of a 
large number of obj ectives are still rare. The main reason 
is that these problems cause additional challenges with 
respect to low-dimensional ones. This chapter gives a 
detailed analysis of these challenges, provides a critical 
review of the traditional remedies and methods for the 
evolutionary optimisation of many-objective problems 
and presents the latest advances in this field. 



BACKGROUND 

There has been considerable recent interest in the op- 
timisation of problems consisting of more than three 
performance criteria, realm that was coined many- 
objective optimisation by Farina and Amato (Farina, 
& Amato, 2002). To date, the vast majority of the 
literature has focused on two and three-dimensional 
problems (Deb, 2001). However, in recent years, the 
incorporation of multiple indicators into the problem 
formulation has clearly emerged as a prerequisite for 
a sound approach in many engineering applications 
(Coello Coello, Van Veldhuizen, & Lamont, 2002). 
Despite the tremendous development that MOEAs 
have undergone over the last decade, and their ample 
success in disparate applications, studies addressing 
high-dimensional real-life problems are still rare (Coello 
Coello, & Aguirre, 2002). The main reason is that 



many-objective problems cause additional challenges 
with respect to low-dimensional ones: 

If the dimensionality of the objective space increases, 
then in general, the dimensionality of the Pareto-optimal 
front also increases. 

The number of points required to characterise the 
Pareto-optimal front increases exponentially with the 
number of objectives considered. 

It is clear that these two features represent a hin- 
drance for most of the population-based methods, 
including MOEAs. In fact, in order to provide a good 
approximation of a high-dimensional optimal Pareto 
front, this class of algorithms must evolve populations 
of solutions of considerable size. This has a profound 
impact on their performance, since evaluating each in- 
dividual solution may be a time-consuming task. Using 
smaller populations would not be a viable option, at least 
for Pareto-based algorithms, given the progressive loss 
of selective pressure they experience as the number of 
objectives increases, with a consequent deterioration of 
performances, as it is theoretically shown in (Farina, 
& Amato, 2004) and empirically evidenced in (Deb, 
2001, pages 404-405). In contrast to Pareto-based 
methods, traditional multi-objective optimisation ap- 
proaches, which work by reducing the multi-objective 
problem into a series of parameterised single-objective 
ones that are solved in succession, are not affected by 
the curse of dimensionality. However, such strategies 
cause each optimisation to be executed independent to 
each other, thereby losing the implicit parallelism of 
population-based multi-objective algorithms. 

The remainder of this chapter provides a detailed 
review of the methods proposed to address the first two 
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issues affecting many-objective evolutionary optimisa- 
tion and discusses the latest advances in the field. 



REMEDIAL MEASURES: 
STATE-OF-THE-ART 

The possible remedies that have been proposed to ad- 
dress the issues arising in evolutionary many-objective 
optimisation can be broadly classified as follows: 

aggregation, goals and priorities 
conditions of optimality 
dimensionality reduction 

In the next sub-sections we give an overview of 
each of these methods and review the approaches that 
have been so far proposed. 

Aggregation, Goals and Priorities 

This class of methods tries and overcome the diffi- 
culties described in the previous section through the 
decomposition of the original problem into a series of 
parameterised single-objective ones, that can then be 
solved by any classical or evolutionary algorithm. 

Many aggregation-based methods have been pre- 
sented so far and they are usually based on modifications 
of the weighted sum approach, such as the augmented 
Tchebycheff function, that are able to identify exposed 
solutions, and explore non-convex regions of the 
trade-off surface. However, the problem of selecting 
an effective strategy to vary weights or goals so that 
a representative approximation of the trade-off curve 
can be achieved is still unresolved. 

The 8-constraint approach (Chankong, & Haimes, 
1983), which is based on minimisation of one (the 
most preferred or primary) objective function while 
considering the other objectives as constraints bound 
by some allowable levels, was also used in the context 
of evolutionary computing. The main limitation of this 
approach is its computational cost and the lack of an 
effective strategy to vary bound levels (e). Recently, 
Laumanns et al. (Laumanns, Thiele, & Zitzler, 2006) 
proposed a variant of the original approach where they 
developed a variation scheme based on the concept 
of 8-Pareto dominance (efficiency) (White, 1986) that 
adaptively generates constraint values, thus enabling 
the exhaustive exploration of the Pareto front, provided 



the scheme is coupled with an exact single-objective 
optimiser. It must be pointed out however, that none of 
the methods described above has ever been thoroughly 
tested in the context of many-objective optimisation. 
The Multiple Single Objective Pareto Sampling 
(MSOPS 1 & 2), an interesting hybridisation of the ag- 
gregation method with goal specification, was presented 
in (Hughes, 2003, Hughes, 2005). In the MSOPS, the 
selective pressure is not provided by Pareto ranking. 
Instead, a set of user defined target vectors is used in 
turn, in conjunction with an aggregation method, to 
evaluate the performance of each solution at every 
generation of a MOEA. The greater is the number 
of targets that a solution nears, the better its rank. 
The authors suggested two aggregation methods: the 
weighted min-max approach (implemented inMSOPS) 
and the Vector- Angle-Distance-Scaling (implemented 
in MSOPS 2). The results indicated with statistical 
significance that NSGA-II (Deb, Pratap, Agarwal, & 
Meyarivan, 2002), the Pareto-based MOEA used for 
comparative purposes, was outperformed on many 
objective problems. This was also recently confirmed 
by Wagner eta/, in (Wagner, Beume, &Naujoks, 2007), 
where they benchmarked traditional MOE As, aggrega- 
tion-based methods and indicator-based methods on 
a up to 6-objective problems and suggested a more 
effective method to generate the target vectors. 

Conditions of Optimality 

Recently, great attention has been given to the role 
that conditions of optimality may play in the context 
of many-objective evolutionary optimisation when 
used to rank trial solutions during the selection stage 
of MOEA in alternative to, or conjunction with, Pa- 
reto efficiency. Farina et al. (Farina, & Amato, 2004) 
proposed the use of a fuzzy optimality condition, but 
did not provide a direct means to incorporate it into a 
MOEA. Koppen et al. (Koppen, Vincente-Garcia, & 
Nickolay, 2005) also suggested the fuzzification of the 
Pareto dominance relation, which was exploited within 
a generational elitist genetic algorithm on a synthetic 
MOP. The concept of knee (Deb, 2003), has also been 
exploited in the context of evolutionary many-objec- 
tive optimisation. Simply stated, a knee is a portion of 
a Pareto surface where the marginal substitution rates 
are particularly high, i.e. a small improvement in one 
objective lead to a high deterioration of the others. A 
graphical representation is given in Figure 1 . The idea 
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is that, with no prior information about the preference 
structures of the DM, the knee is likely to be the most 
interesting area. Branke et al. (Branke, Deb, Dierolf, 
& Osswald, 2004) developed two methodologies to 
detect solutions laying on knees, and incorporated 
them into the second stage (crowding measure) of the 
ranking procedure of NSGA-II. The first methodology 
consists in evaluating for each individual in a popula- 
tion the angle between itself and some neighbouring 
solutions and to use this value to favour solutions with 
higher angles, i.e. closer to the knee. The methodology, 
however, scales poorly with the number of objectives. 
The second strategy resorts to the expected marginal 
utility function to detect solutions close to the knee. 
This approach extends easily with the number of ob- 
jectives; however, the sampling necessary to evaluate 
the expectation of the marginal utility function with a 
certain confidence may become expensive. Neither of 
these approaches has been tested on many-objective 
problems. 

The concept of approximate optimal solutions has 
also been investigated to some extent in the context 
of evolutionary many-objective optimisation. In par- 
ticular, 8-efficiency was considered to be potentially 



effective to ease some of the difficulties associated 
with many-objective problems. A recent study by 
Wagner et al. (Wagner, Beume, & Naujoks, 2007) 
showed the excellent performance of 8-MOEA (Deb, 
Mohan, & Mishra, 2003) on a 6-objective instance 
of two synthetic test functions. A good review on the 
application of approximate conditions of optimality 
is given in (Burke, & Landa Silva, 2006), where the 
authors also compared the effect of using two relaxed 
forms of Pareto dominance as evaluation methods 
within two MOEAs. 

Recently,di Pierro et al. (di Pierro, Khu, & Savic, 
2007) proposed a ranking scheme based on Preference 
Ordering (Das, 1999), a condition of optimality that 
generalises Pareto efficiency, but it is more stringent, 
and tested it using NSGA-II as the optimisation shell 
on a suit of seven benchmark problems with up to 
eight objectives. Results indicated that the methodol- 
ogy proposed enhanced significantly the convergence 
properties of the standard NSGA-II algorithm on all 
the test problems. The strengths of this approach are 
its absence of parameters to tune and the fact that it 
showed very good performances across varying problem 
features; the drawbacks its computation runtime and 



Figure 1. Simple Pareto front with a knee 
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the fact that its combination with diversity preserv- 
ing mechanisms that favours extreme solutions may 
ingenerate too high of a selective pressure. 

In (Sato, Aguirre, & Tanaka, 2007), Sato et al. in- 
troduced an approach to modify the Pareto dominance 
condition used within the selection stage of Pareto- 
based MOEAs, by which the area dominated by any 
given point is contracted or expanded according to a 
formula derived from the Sine theorem and the extent 
of the contraction/expansion is controlled by a constant 
factor. The results of a series of experiments performed 
using NSGA-II equipped with the contraction/expan- 
sion mechanism on 0/1 multi-objective knapsack 
problems showed substantially improved convergence 
and diversity performances compared to the standard 
NSGA-II algorithm. However, it was also shown that 
the optimal value for the contraction/expansion factor 
depends strongly on various problem features, and no 
indication was given to support a correct choice. 

Most modern MOEAs rely on a two-stage ranking 
during the selection process. At the first stage the ranks 
are assigned according to some form of Pareto-based 
dominance relation; if ties exist, these are resolved 
resorting to mechanisms that favour a good distribution 
along the Pareto front of the solutions examined. It has 
now been acknowledged that this second stage of the 
ranking procedure may in fact be detrimental in case 
of many objectives, as it was shown in (Purshouse, 
& Fleming, 2003b) for the case of NSGA-II. Recent 
efforts have therefore focused on replacing diver- 
sity preserving mechanisms at this second stage with 
more effective ones. Koppen and Yoshida (Koppen, 
& Yoshida, 2007) proposed four secondary ranking 
assignments and tested them by replacing the crowd- 
ing distance assignment within NSGA-II. The results 
indicated improved convergence in all cases compared 
to the standard NSGA-II. However, the authors did 
not report any result on the diversity performance of 
the algorithms. 

Dimensionality Reduction Methods 

The aim of this class of methods is usually to transform 
the objective space into a lower dimension represen- 
tation, either one-off (prior to the optimisation) or 
iteratively (as the search progresses). 

Deb and Saxena (Deb, & Saxena, 2006) developed 
a procedure based on principal component analysis 



(PCA) for reducing the dimension of the problem to 
solve. The procedure consists in performing a series 
of optimisations using a state-of-the-art MOEA, each 
one focusing only on the objectives that PCA found 
explaining most of the variance on the basis of Pareto 
front obtained with the previous optimisation. Recently 
Saxena and Deb (Saxena, & Deb, 2007) extended their 
work and replaced PC A with two dimensionality reduc- 
tion techniques, the correntropy PCA and a modified 
maximum variance unfolding that could also detect 
non-linear interactions in the objective space. The re- 
sults indicated that the former method suffered to some 
extent from a difficult choice o the best kernel function 
to use, whereas for the latter, the authors performed 
a significant number of experiments to suggest bound 
values of the only free parameter of the procedure. It 
must be highlighted that these two studies are the only 
efforts that have challenged new algorithms on highly- 
dimensional test problems (up to 50 objectives). 

In a recent study Brockhoff and Zitzler (Brockhoff, 
& Zitzler, 2006b) introduced the minimum objective 
subset problem (MOSS), which is concerned with the 
identification of the largest set of objectives that can be 
removed without altering the dominance structure of the 
problem (i.e. the set of Pareto optimal solutions obtained 
considering all the objectives or only the MOSS is the 
same), and developed an exact algorithm and a greedy 
heuristic to solve it. Subsequently (Brockhoff, & Zitzler, 
2006a), they proposed a measure of variation for the 
dominance structure and extended the MOSS to allow 
for dimensionality reductions involving predefined 
thresholds of problem structure changes. However, 
they did not propose a mechanism to incorporate these 
algorithms within a MOEA. 

Recently, the analysis of the relationships of inter- 
dependence between the objectives of an optimisation 
problem has been successfully exploited to devise ef- 
fective reduction methods. Following the definitions 
of conflict, support or harmony, and independence 
proposed in (Carlsson, & Fuller, 1995), Purshouse and 
Fleming (Purshouse, & Fleming, 2003a) discussed the 
effects of these relationships in the context of many- 
objective evolutionary optimisation. In a later study 
Purshouse and Fleming (Purshouse, & Fleming, 2003 c) 
also suggested, in the case of objectives independence, 
a divide-and-conquer algorithm based on objective 
space decomposition. 
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FUTURE TRENDS 



REFERENCES 



As it appears from the discussion above, there is an 
increasing effort to develop strategies that are able 
to overcome the limitations of Pareto-based methods 
when solving problems with many obj ectives. Although 
promising results have been generally reported, most 
of the approaches presented are of an empirical nature, 
which makes it difficult to draw conclusions that can 
be generalised. 

With the exception of dimensionality reduction 
techniques, the majority of the studies presented to 
date focus on mechanisms to improve the ranking of 
the solutions in the selection process. However, the 
analysis of these mechanisms is usually undertaken in 
isolation with respect to the other components of the 
algorithms. In our view, this is an important limitation 
that next generation algorithms will have to address, in 
particular, by undertaking the analysis of these mecha- 
nisms in relation with the variation operators. 

Moreover, there has been little attention in trying 
to characterise the solutions that a given method (be- 
longing to the first or second category identified in the 
previous section) favours, in relation to the properties 
of the problem being solved. Theoretical frameworks 
are therefore needed in order to analyse existing meth- 
ods and develop more focused approaches. As it was 
pointed out by di Pierro in (di Pierro, 2006), where he 
provided a theoretical framework to analyse the effect 
of the Preference Ordering-based ranking procedure in 
relation to the interdependence relationships a problem, 
this approach enables predicting the effect of applying a 
given methodology to a particular problem with limited 
prior knowledge, which is certainly an advantage since 
the goal of developing powerful algorithms is to solve 
(often for the first time) real life problems. 



CONCLUSIONS 

In this chapter we have provided a comprehensive review 
of the state-of-the-art of evolutionary algorithms for the 
optimisation of many objective problems discussing 
limitations and strengths of the approaches described, 
and we have suggested future trends of research for a 
field that is gathering increasing momentum. 
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KEY TERMS 

Evolutionary Algorithms: Solution methods 
inspired by the natural evolution process that evolve 
a population of solutions to an optimisation problem 
through iterative application of randomised processes 
of recombination and selection, until a termination 
criteria is met. 

Many-Objective Problem: Problem consisting 
of more than 3-4 objectives to be concurrently maxi- 
mised/minimised. 



Pareto Front: Image of the Pareto Set onto the 
performance (objective) space. 

Pareto Optimal Solution: Solution that is not 
dominated by any other feasible solution. 

Pareto Set: Set of Pareto Optimal solutions. 

Ranking Scheme: Scheme that assigns to each 
solution of a population a score that is a measure of 
its fitness relative to the other members of the same 
population. 

Selective Pressure: The ratio between the number 
of expected selections of the best solution and the mean 
performing one. 



1048 



1049 



Mapping Ontologies by Utilising Their 
Semantic Structure 




Yi Zhao 

Fer miniver sitaet in Hagen, Germany 

Wolfgang A. Halang 

Fermmiver sitaet in Hagen, Germany 



INTRODUCTION 

As a key factor to enable interoperability in the Semantic 
Web (Berners-Lee, Hendler & Lassila, 2001), ontolo- 
gies are developed by different organisations at a large 
scale, also in overlapping areas. Therefore, ontology 
mapping has come into forth to achieve knowledge 
sharing and semantic integration in an environment 
where knowledge and information are represented by 
different underlying ontologies. 

The ontology mapping problem can be defined as 
acquiring the relationships that hold between the enti- 
ties of two ontologies. Mapping results can be used for 
various purposes such as schema/ontology integration, 
information retrieval, query mediation, or web service 
mapping. 

In this article, a method to map concepts and prop- 
erties between ontologies is presented. First, syntactic 
analysis is applied based on token strings, and then 
semantic analysis is executed according to WordNet 
(Fellbaum, 1999) and tree-like graphs representing 
the structures of ontologies. The experimental results 
exemplify that our algorithm finds mappings with high 
precision. 



BACKGROUND 

Borrowed from philosophy, ontology refers to a sys- 
tematic account of what can exist or 'be' in the world. 
In the fields of artificial intelligence and knowledge 
representation, ontology refers to the construction of 
knowledge models that specify a set of concepts, their 
attributes, and the relationships between them. Ontolo- 
gies are defined as "explicit conceptualisation(s) of a 
domain" (Gruber, 1 993), and are seen as a key to realise 
the vision of the Semantic Web. 

Ontology, as an important technique to represent 



knowledge and information, allows to incorporate 
semantics into data to drastically enhance information 
exchange. The Semantic Web (Berners-Lee, Hendler 
& Lassila, 2001) is as a universal medium for data, 
information, and knowledge exchange. It suggests 
to annotate web resources with machine-processable 
metadata. With the rapid development of the Semantic 
Web, it is likely that the number of ontologies used will 
strongly increase over the next few years. By them- 
selves, however, ontologies do not solve any interoper- 
ability problem. Ontology mapping (Ehrig, 2004) is, 
therefore, a key to exploit semantic interoperability of 
information and, thus, has been drawing great attention 
in the research community during recent years. This 
section introduces the basic concepts of information 
integration, ontologies, and ontology mapping. 

Mismatches between ontologies are mainly caused 
by independent development of ontologies in different 
organisations. They become evident when trying to 
combine ontologies which describe partially overlap- 
ping domains. The mismatches between ontologies 
can broadly be distinguished into syntactic, semantic, 
and structural heterogeneity. Syntactic heterogeneity 
denotes differences in the language primitives used 
to specify ontologies, semantic heterogeneity denotes 
differences in the way domains are conceptualised 
and modelled, while structural heterogeneity denotes 
differences in information structuring. 

There have been a number of previous works pro- 
posed so far on ontology mapping (Shvaiko, 2005, 
Noy, 2004, Sabou, 2006, Su, 2006). In (Madhavan, 
2001), a hybrid similarity mapping algorithm has 
been introduced. The proposed measure integrates the 
linguistic and structural schema matching techniques. 
The matching is based primarily on schema element 
names, not considering their properties. LOM (Li, 2004) 
is a semi-automatic lexicon-based ontology-mapping 
tool that supports a human mapping engineer with a 
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first-cut comparison of ontological terms between the 
ontologies to be mapped. It decomposes multi-word 
terms into their word constituents except that it does 
not perform direct mapping between the words. The 
procedure associates the WordNet synset index numbers 
of the constituent words with ontological term. The 
two terms which have the largest number of common 
synsets are recorded and presented to the user. 



MAIN FOCUS OF THE CHAPTER 

Our current work tries to overcome the limitations 
mentioned above, and to improve precision of ontology 
mapping. The research goal is to develop a method and 
to evaluate results of ontology mappings. 

In this article, we present a method to map ontolo- 
gies synthesised of token-based syntactic analysis, and 
semantic analysis employing the WordNet (Fellbaum, 
1999) thesaurus and tree-structured graphs. The al- 
gorithm is outlined and expressed in pseudo-code as 
listed in Figure 1 . The promising results obtained from 
experiments indicate that our algorithm finds mappings 



with high precision. 

Syntax-Level Mapping Based on 
Tokenisation 

Before employing syntactic mapping, a pre-process- 
ing is inevitable, which is called tokenisation. Here, 
ontologies are represented in the language OWL-DL 1 . 
Therefore, all ontology terms are represented with OWL 
URL For example, in ontology "beer", an OWL Class 
'Ingredient' is described by 



" [OWLClassImpl] http://www.purl.org/net/ontology/ 
beerttlngredient" , 

where "[OWLClassImpl]" implies OWL class, URL 
"http://www.purl.org/net/ ontology/beer" addresses 
the provenance of the ontology, and Tngredient' is the 
class name. Tokenisation should first extract the valid 
ontology entities from OWL descriptions, which, in 
this example, is Tngredient'. 

Moreover, the labels of ontology entities (classes 



Figure 1. Pseudo-code of mapping algorithm 



Input: OWL 01, OWL 02, threshold sigma; 

Output: similarity between 01 and 02; 

Begin 

build tree-structured graphs for 01 and 02, and get their edge sets El and E2; 

for each child node CI e El do 

for each child node Cj e E2 do 

tokenise CI and Cj into token sets tci and tcj ; 

if (tci unequal to tcj) then 

calculate syntactic level similarity Sim syn between tci and tcj; 

if (Sim < sigma) then // semantic mapping 

compute semantic-level similarity of tci, tcj based on WordNet; 
if (tci and tcj have no WordNet relationship) then 

determine similarity Sim tsg with the specific properties and 

relationships between their parent/child nodes in ts-graphs 
fi 



od 



od 
end 



fi 



fi 
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and properties) are quite often defined with different 
representations by different organisations. For instance, 
representations may be with or without connector sym- 
bols, with upper or lower cases, etc., which renders it 
very complicated and hard to identify terms. Tokenisa- 
tion means to parse names into tokens based on specific 
rules or symbols by customisable tokenisers using 
punctuation, upper case, special symbols, digits, etc. 
In this way, a class or property name can be tokenised 
into one or several token strings. For example, the term 
'Social_%26_Science' can be tokenised as 'Social', 
'26', and 'Science'. Note that the terms can sometimes 
contain digits like date which is not neglectable. 

For simplicity, we assume that all terms of ontology 
concepts (classes, properties) are described without 
abbreviations. The mapping process between differ- 
ent class and property names is then transformed to 
mapping between tokens. 

We first check whether the original child nodes 
are equal ignoring case. Otherwise, the tokens are 
used instead to check whether they are equal. If not, 
the similarity measure based on the edit distance is 
adopted to calculate similarity. If the calculated simi- 
larity value is above a threshold o (for example, 0.95), 
the compared nodes are considered to be similar. The 
process continues to deal with the next pair of nodes 
in the same way. 

The edit distance formulated by Levenshtein (Leven- 
shtein, 1 966) and the string mapping method proposed 
by Maedche & Staab (Maedche & Staab, 2002) are 
employed here to calculate token similarity. The edit 
distance is a well-established method to weigh the dif- 
ference between two strings. It measures the minimum 
number of token insertions, deletions, and substitutions 
required to transform one string into another using a 
dynamic programming algorithm. The string matching 
method is used to calculate token similarity based on 
Levenshtein 's edit distance: 



mm(\x\,\Y\)-ed(X,Y) 
Sim(X,Y) = max(0, V| N ^ ,^ k - ) e [0,1] 



min(XLY) 



(1) 



where X, Y are token strings, '\X\' is the length of X, 
'rnin( ) ' and 'maxQ ' denotes the minimum/maximum 
value of two arguments, respectively, and ( ed( )' is the 
edit distance. 

As the original ontology terms may have been to- 



kenised into many sub-terms, i.e., tokens, it is necessary 
to separately calculate similarity between each pair of 
token strings. Assume that the number of tokens of the 
first term is m, n for the second term, and assume m > n, 
the total similarity measure according to Eq. (1) is: 

m n 

Sim syn =tS 1 Sim artg +£Z 03 « S " r V 




i=i j=i 



^+1^-1 



,=i j=i 



(2) 



where Sim . is the similarity between the original 

orig J ° 

strings, Sim., is the similarity between tokens zth and 
jth from two source terms, and co^ co.. are the weights 
for Sim . and Sim... The sum of co.. and ox are supposed 

orig ij ij 1 rr 

to be 1 . Given a predefined similarity threshold, if the 
acquired similarity value is greater than or equal to 
the threshold, then two tokens are considered similar, 
vice versa. 

Semantic-Level Mapping Based on 
Ontology Structure 

Semantic heterogeneity occurs when there is a disa- 
greement about meaning, interpretation, or intended 
use of the same or related data. Semantic relations 
(Gahleitner, & Woess, 2004) are: 

different naming of the same content, i.e., syno- 
nyms, 

different abstraction levels: generic terms vs. 
more specific ones (name vs. first name and last 
name), hypernyms or hyponyms, and 
different structures about the same content (sepa- 
rate type vs. part of a type), i.e., meronyms. 

In ontology mapping, WordNet is one of the most 
frequently used sources of background knowledge. 
Actually, it plays the role of an 'intermediate' to help 
finding semantic heterogeneity. The WordNet library 
can be accessed with the Java API JWNL 2 . It groups 
English words into sets of synonyms called synsets 
providing short, general definitions, and it records the 
various semantic relations between these synonym sets. 
The purpose is to produce a combination of diction- 
ary and thesaurus that is more intuitively usable, and 
to support automatic text analysis and artificial intel- 
ligence applications. It is assumed that each sense in a 
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WordNet synset describes a concept. WordNet senses 
are related among themselves via synonym, hyponym, 
and hyperonym relations. Terms lexicalising the same 
concept (sense) are considered to be equivalent through 
the synonym relationship, while hypernyms, hyponyms, 
and meronyms are considered similar. With this rule, 
the original ontology terms and their relative token 
strings are first checked whether they have the same 
part of speech, i.e., noun, verb, adjective, adverb. The 
next step is to judge whether they are synonyms, hy- 
pernyms, hyponyms, or meronyms. Analogue to Eq. 
(2), for a pair of nodes a similarity value is calculated 
according to the weights of different similarity parts. 
If it exceeds a given threshold, the pair is considered 
to be similar. 

If the above-mentioned syntactic and semantic 
WordNet mapping methods still could not find a map- 
ping between two terms, another semantic-level method 
based on tree-structured graphs is applied. 

As rendered with SWOOP 3 (a hypermedia-inspired 
ontology browser and editor based on OWL ontolo- 
gies, which supports Tenderers to obtain class/property 
hierarchy trees as well as definitions of and inferred 
facts on OWL classes and properties), ontology enti- 
ties are represented by class/property hierarchy trees. 
From class hierarchy trees, tree-structured graphs are 
constructed. Based on the notion of structure graphs 
(Lian, 2004), a tree-structured graph (ts-graph) is 
defined as: 

Definition 1. Given a tree of sets T, iVthe union of all 
sets in T, and E the set of edges in T , then ts-g 
(T) = (JV, E) is called tree-structured graph of T, 



if it holds (a, b) e E if and only if a is a parent 
element of b; a is called parent node, and b child 
node. 

In building a ts-graph, breadth-first traversal is ap- 
plied to traverse a tree hierarchy. To construct a ts-graph, 
we begin with the root node's first child node. All its 
child nodes and their relative parent nodes form edges as 
(parent node, child node) for the ts-graph. The process 
is repeated until the tree is completely traversed. 

After the ts-graphs of two ontologies to be matched 
are built, the edge sets of both graphs are employed in 
the mapping process. The relative positions of a pair's 
(from two ontologies) nodes within their tree-structured 
graphs determines the semantic-level mapping between 
them. In Table 1 we summarise three types of relation- 
ships between edges characterising properties, child 
classes, andparent-and-child classes. By understanding 
these properties, we can derive that entities having the 
same properties are similar. This is not a rule always 
holding true, but it is a strong indictor for similarity. 

Experimental Results 

The implementation of our algorithm was written in 
Java. It relies on SWOOP for parsing OWL files. All 
tests were run on a standard PC (with Windows XP). 
Our test cases include ontologies "Baseball team", and 
"Russia" provided by the institute AIFB 4 . 

To get an impression of the matchmaker's per- 
formance different measures have to be considered. 
There are various ways to measure how well retrieved 
information matches intended information. To evaluate 



Table 1. Relationships between classes 



No. 


Characteristic 


Rules: Given two classes a and b 


Rl 


Properties 


If properties (data type property/object property g null) of a and 
b are similar, a and b are also similar. 


R2 


Child classes 


If all child classes of a and b are similar, a and b are also similar. 


R3 


Parent-and-child 
classes 


If parent class and one of the child classes of a and b are similar, 
a and b are also similar. 
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Figure 2. Analysis of mapping results with ontologies Russia 
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the quality of mapping results, we use standard infor- 
mation-retrieval metrics: Recall (r), Precision (p), and 
F-Measure (Melnik & Rahm, 2002), where Recall is 
the ratio of the number of relevant entities retrieved 
to the total number of relevant entities, Precision is 
the ratio of the number of relevant records retrieved 
to the total number of irrelevant and relevant records 
retrieved, and 



F-Measure ■ 



2rp 
r + p 



(3) 



As shown in Figure 2 , with the increase of the thresh- 
old, the mapping precision for "Russia" gets higher. The 
shortcoming of this method is its efficiency. Though 
we trade off efficiency to get more effective mapping 
results, our algorithm is still applicable to some offline 
web applications, like information filtering (Hanani, 
2001) according to users' profile. 



FUTURE TRENDS 

Currently, the results of similarity computations are 
provided in form of text documents. In order to present 
mapping results more reasonably and understandably, 
the objective of our future work is to treat the results of 
similarity computations as ontology. Another obj ective 



of our future work is to address security problems con- 
nected to ontology mapping, such as trust management 
in the application of web services. 



CONCLUSION 

The overall research goal presented in this article is to 
develop a method for ontology mapping that combines 
syntactic analysis measuring the difference between 
tokens by the edit distance with semantic analysis 
based on WordNet as semantic relation and the simi- 
larity of structured graphs representing the ontologies 
being compared. Empirically we have shown that our 
synthesised mapping method works with relatively 
high precision. 
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KEY TERMS 

Ontology: As a means to conceptualise and structure 
knowledge, ontologies are seen as the key to realise 
the vision of the semantic web 

Ontology Mapping: Ontology mapping is required 
to achieve knowledge sharing and semantic integration 
in an environment with different underlying ontolo- 
gies. 

Precision: The ratio of the number of relevant 
records retrieved to the total number of irrelevant and 
relevant records retrieved. 

Recall: The ratio of the number of relevant entities 
retrieved to the total number of relevant entities. 

Semantic Web: Envisioned by Tim Berners-Lee, 
the semantic web is as a universal medium for data, 
information, and knowledge exchange. It suggests 
to annotate web resources with machine-processable 
metadata. 

Similarity Measure: A method used to calculate 
the degree of similarity between mapping sources. 

Tokenisation: Tokenisation extracts the valid ontol- 
ogy entities from OWL descriptions. 

Tree-Structured Graph: A graphical structure 
to represent a tree with nodes and a hierarchy of its 
edges. 



ENDNOTE 

1 http://www.w3.org/TR/owl-features/ 

2 http://sourceforge.net/projects/jwordnet 

3 www.mindswap.org/2004/SWOOP/ 
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The datasets are available from http://www.aifb. 
uni-karlsruhe.de/WBS/meh/mapping/. 
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INTRODUCTION 

Models and algorithms have been designed to mimic 
information processing and knowledge acquisition of 
the human brain generically called artificial or formal 
neural networks (ANNs), parallel distributed process- 
ing (PDP), neuromorphic or connectionist models. The 
term network is common today: computer networks 
exist, communications are referred to as networking, 
corporations and markets are structured in networks. 
The concept of ANN was initially coined as a hopeful 
vision of anticipating artificial intelligence (AI) syn- 
thesis by emulating the biological brain. 

ANNs are alternative means to symbol program- 
ming aiming to implement neural-inspired concepts in 
AI environments (neural computing) (Hertz, Krogh, & 
Palmer, 1991), whereas cognitive systems attempt to 
mimic the actual biological nervous systems (compu- 
tational neuroscience). All conceivable neuromorphic 
models lie in between and supposed to be a simplified 
but meaningful representation of some reality. In order 
to establish a unifying theory of neural computing and 
computational neuroscience, mathematical theories 
should be developed along with specific methods of 
analysis (Amari, 1989) (Amit, 1990). The following 
outlines a tentatively mathematical-closed framework 
in neural modeling. 



BACKGROUND 

ANNs may be regarded as dynamic systems (discrete 
or continuous), whose states are the activity patterns, 
and whose controls are the synaptic weights, which 
control the flux of information between the process- 
ing units (adaptive systems controlled by synaptic 
matrices). ANNs are parallel in the sense that most 
neurons process data at the same time. This process 
can be synchronous, if the processing time of an 
input neuron is the same for all units of the net, and 



asynchronous otherwise. Synchronous models maybe 
regarded as discrete models. As biological neurons are 
asynchronous, they require a continuous time treatment 
by differential equations. 

Alternatively, ANNs can recognize the state of 
environment and act on the environment to adapt to 
given viability constraints (cognitive systems con- 
trolled by conceptual controls). Knowledge is stored 
in conceptual controls rather than encoded in synaptic 
matrices, whereas learning rules describe the dynamics 
of conceptual controls in terms of state evolution in 
adapting to viability constraints. 

The concept of paradigm referring to ANNs typically 
comprises a description of the form and functions of 
the processing unit (neuron, node), a network topology 
that describes the pattern of weighted interconnec- 
tions among the units, and a learning rule to establish 
the values of the weights (Domany, 1988). Although 
paradigms differ in details, they still have a common 
subset of selected attributes (Jansson, 1991) like 
simple processing units, high connectivity, parallel 
processing, nonlinear transfer function, feedback paths, 
non-algorithmic data processing, self-organization, 
adaptation (learning) and fault tolerance. Some extra 
features might be: generalization, useful outputs from 
fuzzy inputs, energy saving, and potential overall high 
speed operation. 

The digital paradigm dominating computer science 
assumes that information must be digitized to avoid 
noise interference and signal degradation. In contrast, 
a neuron is highly analog in the sense that its computa- 
tions are based on spatiotemporal integrative processes 
of smoothly varying ion currents at the trigger zone 
rather than on bits. Yet neural systems are highly ef- 
ficient and reliable information processors. 

Memory and Learning 

The specificity of neural processes consists in their 
distributive and collective nature. The phenomenon 
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by biological neural networks (NNs) are changing in 
response to extrinsic stimuli is called self-organization. 
The flexible nature of the human brain, represented 
by self-organization, seems to be responsible for the 
learning function which is specific to living organisms. 
Essentially, learning is an adaptive self-organizing 
process. From the training assistance point of view, 
there are supervised and unsupervised neural classifiers. 
Supervised classifiers seek to characterize predefined 
classes by defining measures that maximize in-class 
similarity and out-class dissimilarity. Supervision may 
be conducted either by direct comparison of output with 
the desired target and estimating error, or by specify- 
ing whether the output is correct or not (reinforce- 
ment learning). The measure of success in both cases 
is given by the ability to recover the original classes 
for similar but not identical input data. Unsupervised 
classifiers seek similarity measures without any pre- 
defined classes performing cluster analysis or vector 
quantization. Neural classifiers organize themselves 
according to their initial state, types and frequency of 
the presented patterns, and correlations in the input 
patterns by setting up some criteria for classification 
(Fukushima, 1975) reflecting causal mechanisms. 
There is no general agreement on the measure of their 
success since likelihood optimization always tends to 
favor single instance classes. 

Classification as performed by ANNs has essentially 
a dual interpretation reflected by machine learning 
too. It could mean either the assignment of input pat- 
terns to predefined classes, or the construction of new 
classes from a previously undifferentiated instance set 
(Stutz & Cheesman, 1994). However, the assignment 
of instances to predefined classes can produce either 
the class that best represents the input pattern as in the 
classical decision theory, or the classifier can be used 
as a content-addressable or associative memory, where 
the class representative is desired and the input pattern 
is used to determine which exemplar to produce. While 
the first task assumes that inputs were corrupted by 
some processes, the second one deals with incomplete 
input patterns when retrieval of full information is the 
goal. Most neural classifiers do not require simultane- 
ous availability of all training data and frequently yield 
error rates comparable to Bayesian methods without 
needing prior information. An efficient memory might 
store and retrieve many patterns, so its dynamics must 
allow for as many states of activity which are stable 
against small perturbations as possible. Several ap- 



proaches dealing with uncertainty such as fuzzy logic, 
probabilistic, hyperplane, kernel, and exemplar-based 
classifiers can be incorporated into ANN classifiers in 
applications where only few data are available (Ng & 
Lippmann, 1991). 

The capacity of analog neural systems to operate 
in unpredictable environments depends on their abil- 
ity to represent information in context. The context of 
a signal may be some complex collections of neural 
patterns, including those that constitute learning. The 
interplay of context and adaptation is a fundamental 
principle of the neural paradigm. As only variations and 
differences convey information, permanent change is 
a necessity for neural systems rather than a source of 
difficulty as it is for digital systems. 



MATHEMATICAL FRAMEWORK OF 
NEURONS AND ANNS MODELING 

An approach to investigate neural systems in a general 
frame is the mean field theory (Cooper & Scofield, 1988) 
from statistical physics suited for highly interconnected 
systems as cortical regions are. However, there is a big 
gap between the formal model level of description in 
associative memory levels and the complexity of neural 
dynamics in biological nets. Neural modeling need 
no information concerning correlations of input data, 
rather nonlinear processing units and a sufficiently large 
number of variable parameters ensure the flexibility to 
adapt to any relationship between input and output data. 
Models can be altered externally, by adopting a different 
axiomatic structure, and internally, by revealing new 
inside structural or functional relationships. Ranking 
several neuromorphic models is ultimately carried out 
based on some measure of performance. 

Neuron Modeling 

Central problems in any artificial system designed to 
mimic NNs arise from (z) biological features to be pre- 
served, (ii) connectivity matrix of the processing units, 
whose size increases with the square of their number, 
and (in) processing time, which has to be independent 
of the network size. Biologically realistic models of 
neurons might minimally include: 

Continuous-valued transfer functions (graded 
response), as many neurons respond to their 
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input in a continuous way, though the nonlinear 
relationship between the input and the output of 
cells is a universal feature; 
Nonlinear summation of the inputs and significant 
logical processing performed along the dendritic 
tree; 

Sequences of pulses as output, rather than a simple 
output level. A single state variabley . representing 
the firing rate, even if continuous, ignores much 
information (e.g., pulse phase) that might be en- 
coded in pulse sequences. However, there is no 
relevant evidence that phase plays a significant 
role in most neuronal circuits; 
Asynchronous updating and variable delay of 
data processing, that is, the time unit elapsing 
per processing step, t — ► t + 1 , is variable among 
neurons; 

Variability of synaptic strengths caused by the 
amount of transmitter substance released at a 
chemical synapse, which may vary unpredictably. 
This effect is partially modeled by stochastic 
generalization of the binary neural models dy- 
namics. 



Most neuromimetic models are based on the Mc- 
Culloch and Pitts (1943) neuron as a binary threshold 
unit: 



f N 



y 3 (t+i)=® Zw/x^O-e. 



Vi=l 



(1) 



where y. represents the state of neuron j (either 1 or 
0) in response to input signals {x.}., 0. stands for 
a certain threshold characteristic of each neuron 
j, time t is considered discrete, with one time unit 
elapsing per processing step, and is the unit step 
(Heaviside) function: 



0(x)= 



1, x>0 
0, x<0 



(2) 



The weights, w/ , 1 < i < N , represent the strengths 
of the synapses connecting neuron z to neuron j, and 
may be positive (excitatory) or negative (inhibitory). 
The weighted sum 



Figure 1. Typical sigmoid transfer functions 
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i=l 

of the inputs presented at time t to unit j must reach or 
exceed the threshold 0. for the neuron j to fire. Though 
extremely simplified, a synchronous assembly of such 
formal neurons is theoretically capable of universal 
computation (Welstead, 1994) for suitable chosen 
weights {w/j , i.e., it may perform computations 
that conventional computers do, yet not necessarily 
so rapidly or conveniently. 

A general expression that includes some of the above 
features derived from the digital model (1) is: 



f N 



y j = f j Iw/x.-e,. 



V i=i 



(3) 



where y. is the continuous-valued state (activation) of 
unit j and f. is a general transfer function. Threshold 
nodes are required for universal approximation and the 
activation function ought to be nonlinear with bounded 
output (Fig. 1 ). The neurons are updated asynchronously 
in random order at random times. 

Mathematical Methods for ANNs 
Modeling 

Several neural models and parallel information process- 
ing systems inspired by brain mechanisms were pro- 
posed. Almost all practical applications were achieved 
by simulation on conventional digital computers (von 
Neumann), so the real parallel processing advantages 
and massive unit densities hoped for were lost. Math- 
ematical methods approaching various types of ANNs 
in a unified way and results from linear and nonlinear 
control systems used to obtain learning algorithms 
could be grouped in four categories: 

1. Tensor products and pseudo-inverses of linear 
operators, which represent the specific structural 
connectionism and provide a mathematical expla- 
nation of the Hebbian nature of many learning 
algorithms (Hebb, 1949). This is due to the fact 
that derivatives of a wide class of nonlinear maps 
defined on spaces of synaptic matrices are ten- 
sor products and because the pseudo-inverse of 
a tensor product of linear operators is the tensor 
product of their pseudo-inverses; 



Convex and nonsmooth analysis is particularly 
suited to nonlinear networks in proving the con- 
vergence of two main types of learning rules. The 
first class consists of algorithms derived from 
gradient methods and includes the backpropaga- 
tion update rule, whereas the second class deals 
with algorithms based on Newton's method; 
Control and viability theory (Aubin, 1991), which 
deals with neural systems that learn viable solu- 
tions as control systems satisfying given viability 
(state) constraints. The purpose is to derive algo- 
rithms of control systems emulated by ANNs with 
feedback regulation. Three classes of learning 
rules are envisaged: (z) external learning rules 
based on gradient methods of optimization prob- 
lems involving nonsmooth functions, (ii) internal 
learning rules based on the viability theory, and 
(Hi) uniform algorithms; 

Probability theory and Bayesian statistics. Bayes- 
ian statistics and neural modeling may seem 
extremes of the data-modeling spectrum. ANNs 
are nonlinear parallel computational devices and 
their training by example to solve prediction and 
classification problems is quite a purpose-specific 
procedure. Contrarily, Bayesian statistics is heav- 
ily based on coherent inference and clearly defined 
axioms. Yet both approaches aim to create models 
in good accordance with the data. ANNs can be 
interpreted as more flexible versions of traditional 
regression techniques in the sense of capturing 
regularities in the data that the linear models are 
not able to handle. However, over-flexible ANNs 
may discover non-existent correlations in the data. 
Bayesian inference provides means to infer how 
flexible a model is warranted by the data and sup- 
presses the tendency to assess spurious structure 
in the data by incorporating the Occam's razor 
that sets the preference for simpler models if they 
compete to come out with the same result. Learn- 
ing in ANNs is interpreted as inference on the 
most probable parameters for a model, given the 
training data. The search in the model space can 
also be treated as an inference problem of relative 
probability for alternative models, given the data. 
Bayesian inference for ANNs can be implemented 
numerically by deterministic methods involving 
Gaussian approximations (MacKay, 1992), or by 
Monte Carlo methods (Neal, 1996). 
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Let N formal neurons link, directly for one-layer 
networks and indirectly for multi-layered ones, an 
input space X of signals to an output space Y. The 
state space of the system is the product X x Y of the 
input-output pairs (x,y), which are generically called 
patterns or configurations in pattern recognition (PR), 
data analysis, and classification problems. When X = 
Y and the input of the patterns coincide with the out- 
puts, (x,y), x=y, the system is called autoassociative; 
if the input and output patterns are different, (x,y) y 
i=- x, then the system is heteroassociative. Among all 
possible input-output patterns, a subset K cz X x 7 is 
chosen as training set. Most often, the input and output 
spaces are finite dimensional linear spaces: X =R N 
and Y = R M , whereas the input signals may obey some 
state constraints: 

Real numbers for fuzzy applications, preferably 
in the intervals [0,1] or [-1, +1]; 
Binary numbers that belong to {0, 1 } ; 
Bipolar numbers that belong to {-1 , +1 } . 

If neurons are labeled by j = l,2,...,iV, then let P(N) 
of cardinal 2 N denote the family of subsets of neurons 
called conjuncts (or coalitions) of neurons. Any con- 



nection links a postsynaptic neuron j to conjuncts S cz 
P(N) of presynaptic neurons. Each conjunct S prepro- 
cesses (or gates) the afferent signals {*.). produced by 
the presynaptic neurons through a function: 



to w 



^<P S (*) 



(4) 



If conjuncts are reduced to individual neurons S = 
{/}, then the role of control is played by the synaptic 
matrix: 



W = w' 



j I ScP(lV) 
j=l,2,...,N 



(5) 



where wj represents the entries from S to neuron j. 
The modulus of the synaptic weight wj represents the 
strength and it gives the nature of the connection from 
conjunct S to the formal neuron j, counted positively if 
the synapse is excitatory, and negatively if it is inhibi- 
tory. Accordingly, the neuron j receives the signal 

SczP(N) 



Figure 2. Conjuncts of neurons {2,3} and {4,5,6} gate the inputs to neuron 1 
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(Fig. 2), which determines its state of activity. Hence 
the propagation rule that characterize the network 
dynamics is: 

3W;({^(*)Lj (6) 



for synchronous neurons (discrete dynamical system), 
and: 



xko=^(H(0.9 S (x,oL (n) ) 



(7) 



for asynchronous neurons (continuous dynamical 
system), where in most cases: 






(8) 



Here g. integrates the afferent signals {w 7 s cp s (x)} 
sent to the neuron j by the afferent neurons through 
their outputs {x.} ., preprocessed by the conjunct S, and 
delivered to neuron j via the weight w^. Usually, the 
synaptic weights w s = when j e S, whereas w s * 
whenj g S is associated with autoexcitation. However, 
when S = {/}, S ^ {/}, and w s = , then the loss term 
g. (0,0,...,w z (p. (x),0,...,0) represents some kind of 
forgetting like decaying frequency while the neuron 
in question j is not excited by the others. 

Several neural systems can be expressed within this 
framework (Aub in, 1991). 

1. Associative memories are defined by the lack of 
preprocessing, that is, 

(p s (x)=0 if |s| >1 or (p s (x)=x z . if S = {/} 
then: 



yj=T w ) x i +c J 



(9) 



2. 



where \S\ stands for the number of elements in 
conjunct S. 

Associative memories with gates are defined by 
preprocessing and g. affine: 



yi=1L w i x i +c i 



(10) 




Boolean associative memories correspond to 
X=R N , Y = R, K = {0,1}% {0,1} 

and fiizzy associative memories to 

X=R N , Y = R, K = [0,lf x[0,l] 
hence: 



y= Z wS FH 

S^P(N) ieS 



(11) 



Nonlinear automata are defined by various forms 
of 9r 



f 



(12) 



When appropriate, thresholds e Y may be in- 
tegrated in the processing function g: 



g(z) = h(z-Q) 



(13) 



If the threshold is part of the controls to be ad- 
justed during training, it may be incorporated as 
an entry of an extended synaptic matrix: 



W= wJL 2 ,..,iv eL(RxX,Y), 
j=b,i-,M 



W„ 



[w„ for i = 1,2,..., N;j = 1,2,.. .,M 
[6, for z= 1,2,..., IV; j = 



(14) 



Particularly simple is the perceptron: 

X=R N , Y=R, QeY 

cp s (x)=0 
if |S| > l,then: 
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y- 



if JTw'x. <e 

i=l 

N 

1 if ^w'x^e 



FUTURE TRENDS 



Some promising neural-inspired approaches to feature 
extraction and clustering were also proposed (Mao & 
Jain, 1995), which are adaptive online and may ex- 
hibit additional desirable properties such as robustness 
against outliers (Xu & Yuille, 1995) as compared to 
more traditional feature extraction methods. 

The connection between Bayesian inference and 
neural models gives new perspectives to the assump- 
tions and approximations made on ANNs and algorithms 
when used as associative memories. Advances in neural 
modeling and training algorithm design addressed dy- 
namic range and sensitivity problems encountered by 
large analog systems, along with fast evolution in VLSI 
implementation techniques, could lead to practical real- 
time systems derived from the topology and parallel 
distributed processing performed by biological NNs. 



CONCLUSION 

Though ANNs are able to perform a large variety of 
tasks, the problems handled practically could be loosely 
divided into four basic types (Zupan & Gasteiger, 1 994): 
auto- or hetero-association, classification, mapping 
(Kohonen, 1982), and modeling. 

In neural classifiers, the set of examples used for 
training should necessarily come from the same (pos- 
sibly unknown) distribution as the set used for testing 
the networks, in order to provide reliable generaliza- 
tion in classifying unknown patterns. Valid results are 
only produced if the training examples are adequately 
selected and their number is comparable to the number 
of effective parameters in the net. Quite a few classes 
of learning algorithms have the convergence guaran- 
teed; moreover, they require substantial computational 
resources. 

Generally, ANNs are used as parts of larger systems 
employed as preprocessing or labeling/interpretation 
subsystems. In many cases, the flexibility and non- 
algorithmic processing of ANNs surpass their incon- 



veniences and make them suitable for modeling rather 
complex systems involving plenty of information. 



(15) REFERENCES 



Amari, S. (1989). Dynamical stability of formation of 
cortical maps in dynamic interactions in neural net- 
works. InArbib M. A., & Amari, S. (Eds.) Research work 
in neural computing, 1 (pp. 15-34). Springer- Verlag. 



Amit, D. J. (1990). Attractor neural networks and bio- 
logical reality: Associative memory and learning. Future 
Generation Computer Systems, (5(2), 111-119. 

Aubin, J.-P. (1991). Viability Theory. Birkhauser, 
1991. 

Cooper, L. N. & Scofield, C. L. (1988). Mean-field 
theory of a neural network. Proceedings of the National 
Academy of Sciences of the USA, 85(6), 1973-1977. 

Domany, E. ( 1 98 8). Neural networks : Abiased overview. 
Journal of Statistical Physics, 51(5-6), 743-775. 

Fukushima, K. (1975). Cognitron: A self-organizing 
multilayered neural network. Biological Cybernetics, 
20, 121-136. 

Hebb, D. O. (1949). The organization of behavior, A 
neurophysiological theory. John Wiley & Sons Inc. 

Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduc- 
tion to the theory of neural computation. Addison- 
Wesley Publishing Company. 

Jansson, P. A. (1991). Neural networks: An overview. 
Analytical Chemistry, 63(6), 357-362. 

Kohonen, T. (1 982). Self-organized formation of topo- 
logically correct feature maps. Biological Cybernetics, 
43, 59-69. 

MacKay, D. J. K. (1992). A practical Bayesian frame- 
work for backpropagation networks, Neural Computa- 
tion, 4, 448-472. 

Mao, J. & Jain, A. K. (1995). Artificial neural networks 
for feature extraction and multivariate data projec- 
tion. IEEE Transactions of Neural Networks, 6(2), 
296-317. 

McCulloch, W. S. & Pitts, W. (1943). A logical calcu- 
lus of ideas immanent in nervous activity. Bulletin of 
Mathematical Biophysics, 5, 115-133. 



1062 



Mathematical Modeling of Artificial Neural Networks 



Neal, R. M. (1996). Bayesian learning for neural net- 
works. Springer- Verlag. 

Ng, K. & Lippmann, R. (1991). Practical charac- 
teristics of neural network and conventional pattern 
classifiers. In R. Lippmann, & D. Touretzky (Eds.), 
Advances in neural information processing systems 
(pp. 970-976). San Francisco, CA: Morgan Kauffman 
Publishers Inc. 

Stutz, J. &Cheesman,P. (1994). Autoclass-ABayesian 
approach to classification. In J. Skilling, & S. Sibisi 
(Eds.), Maximum entropy and Bayesian methods (pp. 
117-126). Cambridge: Kluwer Academic Publishers. 

Sutton, R. S. & Barto, A. G. (1998). Introduction to 
reinforcement learning. Cambridge, MA: MIT Press. 

Xu, L. & Yuille, A. L. (1995). Robust principal element 
analysis by self-organizing rules based on statistical 
physics approach. IEEE Transactions on Neural Net- 
works, 5(1), 131-143. 

Welstead, S. T. (1994). Neural network and fuzzy logic 
applications in C/C++. John Wiley & Sons. 

Zupan, J. & Gasteiger, J. (1994). Neural networks for 
chemists. An introduction. VCH Verlagsgesellschaft 
mbH, Germany. 



KEY TERMS 

Artificial Neural Networks (ANNs): Highly paral- 
lel networks of interconnected simple computational 
elements (cells, nodes, neurons, units), which mimic 
biological neural network. 



Complex Systems: Systems made of several in- 
terconnected simple parts which altogether exhibit a 
high degree of complexity from each emerges a higher 
order behaviour. 

Emergence: Modalities in which complex systems 
like ANNs and patterns come out of a multiplicity of 
relatively simple interactions. 

Learning (Training) Rule: Iterative process of 
updating the weights from cases (instances) repeatedly 
presented as input. Learning (adaptation) is essential 
in PR where the training data set is limited and new 
environments are continuously encountered. 

Paradigm of ANNs: Set of (i) pre-processing units' 
form and functions, (ii) network topology that describes 
the number of layers, the number of nodes per layer, 
and the pattern of weighted interconnections among the 
nodes, and (iii) learning (training) rule that specifies 
the way weights should be adapted during use in order 
to improve network performance. 

Relaxation: Process by which ANNs minimize 
an objective function using semi- or non-parametric 
methods to iteratively updating the weights. 

Robustness: Property of ANNs to accomplish 
reliable its tasks when handling incomplete and/or cor- 
rupted data. Moreover, the results should be consistent 
even if some part of the network is damaged. 

Self-Organization Principle: A process in which 
the internal organization of a system that continually 
interacts with its environment increases in complexity 
without being driven by an outside source. Self-organiz- 
ing systems typically exhibit emergent properties. 
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INTRODUCTION 

Technological advances in high-throughput techniques 
and efficient data gathering methods, coupled compu- 
tational biology efforts, have resulted in a vast amount 
of life science data often available in distributed and 
heterogeneous repositories. These repositories contain 
information such as sequence and structure data, an- 
notations for biological data, results of complex com- 
putations, genetic sequences and multiple bio-datasets. 
However, the heterogeneity of these data, have created 
a need for research in resource integration and platform 
independent processing of investigative queries, involv- 
ing heterogeneous data sources. 

When processing huge amounts of data, information 
integration is one of the most critical issues, because 
it's crucial to preserve the intrinsic semantics of all the 
merged data sources. This integration would allow the 
proper organization of data, fostering the analysis and 
access the information to accomplish critical tasks, such 
as the processing of micro-array data to study protein 
function and medical researches in making detailed 
studies of protein structures to facilitate drug design 
(Ignacimuthu, 2005). Furthermore, DNA micro-array 
research community urgently requires technology to 
allow up-to-date micro-array data information to be 
found, accessed and delivered in a secure framework 
(Sinnot, 2007). 

Several research disciplines, such as Bioinformatics, 
where information integration is critical, could benefit 



from harnessing the potential of a new approach: the 
Semantic Web (SW). The SW term was coined by 
Berners-Lee, Hendler and Lassila (2001) to describe 
the evolution of a Web that consisted of largely docu- 
ments for humans to read towards a new paradigm 
that included data and information for computers to 
manipulate. The SW is about adding machine-under- 
standable and machine-processable metadata to Web 
resource through its key-enabling technology: ontolo- 
gies (Fensel, 2002). Ontologies are a formal explicit 
and shared specification of a conceptualization. The 
SW was conceived as a way to solve the need for data 
integration on the Web. 

This article expounds SAMIDI, a Semantics-based 
Architecture for Micro-array Information and Data 
Integration. The most remarkable innovation offered by 
SAMIDI is the use of semantics as a tool for leverag- 
ing different vocabularies and terminologies and foster 
integration. SAMIDI is composed of a methodology 
for the unification of heterogeneous data sources from 
the analysis of the requirements of the unified data set 
and a software architecture. 



BACKGROUND 

This section introduces Bioinformatics and its need 
to process massive amounts of data; the benefit of the 
integration of the existing data sources of biological 
information and semantics, a tool for integration. 
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Bioinformatics 

The term Bioinformatics was coined by Hwa Lim in the 
late 1980s, and later popularized through its associa- 
tion with the human genome project (Goodman, 2002). 
Bioinformatics is the application of information science 
and technologies for the management of biological 
data (Denn & MacMullen, 2002) and it describes any 
use of computers to store, compare, retrieve, analyze 
or predict the composition of the structure of biomol- 
ecules (Segall & Zhang, 2006). Research on Biology 
requires Bioinformatics to manipulate and discover new 
biological knowledge at several levels of increasing 
complexity. Biological data are produced through high- 
throughput methods (Vyas & Summers, 2005), which 
means that they have to be represented and stored in 
different formats, such as micro-arrays. 

Micro-Array Data Sources 

ADNAmicro-array is a collection of microscopic DNA 
spots attached to a solid surface forming an array for 
the purpose of expression profiling, which monitors 
expression levels for thousands of genes simultaneously. 
Those features are read by a scanner that measures the 
level of activation, and the data is downloaded onto 
a computer for subsequent analysis (Cohen, 2005). 
Micro-arrays allow investigating million of genes 
simultaneously (Segall & Zhang, 2006). A biological 
experiment may require hundreds of micro-array, where 
a single micro-array generates up to million fragments 
of data (Murphy, 2002). This fact makes data analysis 
and management a maj or challenge for gene expression 
studies using micro-arrays (Xu, Maresh, Giardina & 
Pincus, 2004). 

The need to manage data generated from Bioinfor- 
matics is crucial. Understanding biological processes 
necessitates access to collections of potentially dis- 
tributed, separately owned and managed biological 
data sets (Sinnott, 2007). These data sources reside 
in different storages, hardware platforms, data base 
management systems, data models and data languages 
(Chen, Promparmote & Maire, 2006), which makes 
impossible their integration. To make things worse, 
this incompatibility is not limited to the use of different 
data technologies, but also because of its incompat- 
ibility in terms of semantics. This heterogeneity can 
be of two sorts: syntactic and semantic (Verschelde, 
Dos Santos, Deray, Smith & Ceusters, 2004). Syntactic 



heterogeneity refers to differences in data models and 
data languages and can be easily resolved. Semantic 
heterogeneity refers to the underlying meanings of the 
data represented. It gives origin to naming conflicts and 
structural conflicts. 

This incompatibility, and the necessity of shar- 
ing and aggregating information among the existing 
micro-array data sources leads researchers to seek for 
data integration. 

Micro-Array Data Integration 

Data analysis and management represent a major chal- 
lenge for gene expression studies using micro-arrays 
(Xu, Maresh, Giardina & Pincus, 2004). Micro-array 
technology is still rather new and standards are not 
established (Murphy, 2002). This lack of standardiza- 
tion impedes micro-array data exchange. However, 
several proj ects have been started with a common goal: 
facilitate the exchange and analysis of micro-array 
data. MIAME (Minimum Information About Micro- 
array Experiment) is an XML based standard for the 
description of micro-array experiments. It's gaining 
importance because is required by numerous journals 
for the submission of articles providing micro-array 
experiments results. The purpose of MIAME is to define 
the core information needed for the description of an 
array based gene expression monitoring experiment. 
MAGE (Micro-Array Gene Expression) is a standard 
micro-array data model and exchange format that is 
able to capture information specified by MIAME. 

Integration and Semantics 

The ambiguity of terms, both within and between dif- 
ferent databases and terminologies, makes integrating 
bioinformatics data task highly error prone (Verschelde 
et a/, 2004). Converting all this information into a 
common data format will likely never be achieved 
and, therefore, the solution to the effective informa- 
tion management problem will necessarily go through 
the establishment of a common understanding. At this 
point is where semantics comes into play, bridging 
nomenclature and terminological inconsistencies to 
comprehend underlying meaning in a unified manner. 
The key elements that enable semantic interoperability 
are ontologies; semantic models of the data and they 
interweave human understanding of symbols with their 
machine processability (Delia Valle, Cerizza, Bicer, 
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Kabak, Laleci & Lausen, 2005). Ontologies allow to 
organise terms used in heterogeneous data sources 
according to their semantic relationships so that hetero- 
geneous data fragments can be mapped into a consistent 
frame of reference (Buttler, Coleman, Critchlow, Fileto, 
Han, Pu, Rocco & Xiong, 2002). 

Applying semantics allows capturing the meaning 
of data one single time. Without semantics, each data 
element will have to be interpreted several times from 
its design and implementation until its use, facilitating 
error raise. Finally, semantics allows turning a great set 
of data sources into an integrated, coherent and unique 
body of information. The architecture of the information 
itself contains a record to keep the meaning and locate 
each data asset, enabling the automation of overlapping 
and redundancy analysis. 

One of the most important contributions of ontol- 
ogy to the unification of biological data schemas is the 
MGED (Microarray Gene Expression Data) ontology 
(Whetzel, Parkinson, Causton, Fan, Fostel, Fragoso, 
Game, Heiskanen, Morrison, & Rocca-Serra, 2006) 
that was prompted by the heterogeneity among MI- 
AME and MAGE formats. MGED is a conceptual 
model for micro-array experiments that establishes 
concepts, definitions, terms and resources for standard- 
ized description of a micro-array experiment in support 
of MAGE. MGED has been accepted as a unifying 
terminology by the micro-array research community, 
which makes it a perfect candidate for becoming a 
universal understanding model. 



THE SAMIDI APPROACH 

SAMIDI is both a methodology to allow the conver- 
sion of a set of micro-array data sources into a unified 
representation of the information they include, and a 
software architecture. A general overview of SAMIDI 
is depicted in Figure 1. 

The Unifying Information Model (UIM) 

It brings together all the physical data schemas related 
to the data sources to be integrated. It doesn't repli- 
cate any of the data models, but is built to represent 
an agreed-upon scientific view and vocabulary which 
will be the foundation to understand the data. The 
UIM might capture the major concepts present in each 
schema for, after applying a semantic mapping, relate 
the physical schemas of the various data sources to 
the Model itself. Thus, the semantic mapping encap- 
sulates the meaning of the data, taking into account 
the agreed-upon terminology. The UIM also provides 
the basis for the creation of new data assets, assuring 
they are consistent with the underlying schemas, and 
serves as a reliable reference for understanding the 
interrelationship between seemingly unrelated sources 
and for automatically planning the translation of data 
elements between them. 



Figure 1. SAMIDI 




UIM 
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The Semantic Information Management 
Methodology (SIMM) 

Its aim is to fill the existent gap between the frag- 
mented information scenario represented by the set 
of micro-array data sources to be integrated and the 
required semantic data integration. The methodology 
is arranged in such a way that each of the phases gener- 
ates a substantial added value itself, while concurrently 
progressing towards the full semantic integration. 
Next, the different stages of the methodology (Figure 
2) are detailed: 

1. Capture Requirements: The project scope is es- 
tablished by surveying the relevant data sources 
to be integrated and determining the information 
requirements of the aimed information model. 

2. Collect Metadata: Data assets are classified and 
catalogued while the relevant (according to the 
organization's use of data) metadata are col- 
lected. 

3 . Build UIM: The structure of the UIM is determined 
by representing the desired business world- view, 
a comprehensive and consistent vocabulary and 
a set of business rules. 

4. Rationalize data semantics: The meaning of the 
represented data is mapped into the UIM. 

5 . Deploy: The UIM, the metadata and the semantics 
are shared among the intended stakeholders and 
customized to fulfill their needs. 

6. Exploit: Business processes are created, ensur- 
ing that the architecture has achieved the data 
management, data integration and data quality 
objectives. 



The effective application of the SIMM requires a 
support information system. Key components of the 
supporting system should include: 

A repository for storing the collected metadata 
on data assets, schemas and models. 
A set of semantic tools for integrated ontology 
modeling for the creation of the UIM and the 
semantic mapping of the data schemas to the 
model. 

The standard business terminology stemmed from 
the UIM should be used across the supporting 
system. 

Data management capabilities of the system 
should include authoring and editing the infor- 
mation model; discovering data assets for any 
particular business concept; creating qualitative 
and quantitative reports about data assets; testing 
and simulating the performance of the UIM and 
impact analyzing in support of change. 
The system should fully support data integration 
by automatically generating code for queries and 
translation scripts between mapped schemas 
harnessing the common understanding acquired 
from the data semantics. 

Data quality should be approached by support- 
ing the identification and decommissioning of 
redundant data assets, comparison mechanisms 
for ensuring consistency among semantically dis- 
similar data, and validation/cleansing of individual 
sources against the central repository of rules. 
The system should allow bi-directional com- 
munication with other systems for exchanging 
metadata and models by means of adaptors and 




Figure 2. SIMM 
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standards such as XML Metadata Interchange 
standard. Similarly, the system should be able to 
collect metadata and other data assets from rela- 
tional databases or other kind of repositories. 
Capability of active data integration. The system 
should have a Run-Time interface for the auto- 
matic generation and exporting of queries, transla- 
tion scripts, schemas and cleansing scripts. 
The user interface should provide a rich thick- 
client for power users in the data management 
group. 

The system should include a transversal platform 
supporting shared functionalities such as version 
control, collaboration tools, access control and 
configuration for all metadata and active content 
in the system. 

The SAMIDI Software Architecture 

A detailed description of the components comprised 
in the SAMIDI software architecture (see Figure 3) is 
presented now: 

SearchBot: Software agent whose mission is to 
perform a methodical, systematic and automatic 
browsing of the integrated information sources. 
Semantic Engine: Integrated set of tools for the 
semantic mapping of the data schemas to the 



UIM using the MGED ontology. It will provide 
a semi-automatic mechanism for the mapping of 
schemas and concepts or categories in the UIM 
so as to lessen the workload of a process that will 
require human intervention. Fully automatic map- 
ping has been discarded since it is regarded as not 
recommendable due to semantic incompatibilities 
and ambiguities among the different schemas 
and data formats. The objective of the Semantic 
Engine is to bridge the gap between cost-efficient 
machine-learning mapping techniques and pure 
human interaction. 

YARS: The YARS (Yet Another RDF Store) 
(Harth & Decker, 2005) system is a semantic data 
store that allows semantic querying and offers a 
higher abstraction layer to enable fast storage and 
retrieval of large amounts of metadata descriptions 
while keeping a small footprint and a lightweight 
architecture approach. 

GUI: Enables user-system interaction. It collects 
requests following certain search criteria, transfers 
the appropriate commands to and displays the 
results provided as a response from the RunTime 
Manager. 

Query Factory: Build queries into YARS storage 
systems using a query language. The semantics 
of the query are defined by an interpretation of 
the most suitable results of the query instead of 



Figure 3. SAMIDI architecture 
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strictly rendering a formal syntax. YARS stores 
RDF triples, and the query factory, fro pragmatic 
reasons, implements SPARQL Query Language 
for RDF (Prud'hommeaux & Seaborne, 2004). 
RunTime Manager: This component coordinates 
the interactions among the rest of components. 
First of all, it communicates with the Semantic 
Engine to verify that the information collected 
by the SearchBot is adequately mapped on the 
MGED ontology as a UIM and stored into YARS 
using RDF syntax. Secondly, if accepts the users' 
search requests through the GUI and hands them 
over the Query Factory, which, in turn queries 
YARS to retrieve all the metadata descriptions 
related to the particular search criteria. By retriev- 
ing a huge amount of metadata information from 
all the integrated data sources, the user benefits 
from a knowledge aware search response which 
is mapped to the underlying terminology and uni- 
fied criteria of the UIM, with the added advantage 
that all resources can be tracked and identified 
separately. 



FUTURE TRENDS 

We believe that the SW and SW Services (SWS) 
paradigm promise a new level of data and process 
integration that can be leveraged to develop novel high- 
performance data and process management systems for 
biological applications. 

Using semantic technologies as a key technology for 
interoperation of various datasets enables data integra- 
tion of the vast amount of biological and biomedical 
data. In a nutshell, the use of knowledge-oriented 
biomedical data integration would lead to achieving 
Intelligent Biomedical Data Integration, which will 
bring biomedical research to its full potential. 

A future trend of SAMIDI effort is to integrate it 
in a SWS scenario in order to achieve seamless inte- 
gration, also from the service or process integration 
perspective. This would enable the access to a number 
of heterogeneous data resources which are accessed via 
a Web Service interface and it would open the scope 
and goals of SAMIDI to a broader base. Promising 
integrating efforts in that direction have already been 
undertaken by the Biomedical Information and Integra- 
tion Discovery with SWS (BIRD) platform (Gomez, 
Rico, Garcia-Sanchez, Liu & Terra, 2007), which 



fosters the intelligent interaction between natural lan- 
guage user intentions and the existing SWS execution 
environments. BIRD is a platform designed to interact 
with humans as a gateway or a man-in-the-middle 
towards SWS execution environments. The main goal 
of the system is to help users express their needs in 
terms of information retrieval and achieve information 
integration by means of SWS. BIRD allows users to 
state their needs via natural language or using a list 
of terms extracted from the Gene Ontology, infer the 
goals derived from the users' wishes and send them to 
the suitable SWS execution environment, which will 
retrieve the outcome resulting of the integration of the 
applications being accessed (e.g. all the biomedical 
publications and medical databases). 



CONCLUSION 

SAMIDI, represents a tailor-made contribution aimed 
to tackling the poser of discovering, searching and inte- 
grating multiple micro-array data sources harnessing the 
distinctive features of semantic technologies. The main 
contribution of SAMIDI is that it makes a decomposition 
of the unmanageable problem of integrating different 
and independent micro-array data sources. 

SAMIDI is the first step to foster and extend the 
idea of using semantic technologies for the integra- 
tion of different data sources not only originated from 
micro-array research but stemming from biomedical 
research areas. 
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KEY TERMS 

Bioinformatics : Application of information science 
and technologies for the management of biological data 
and the use of computers to store, compare, retrieve, 
analyze or predict the composition of the structure of 
biomolecules. 

DNA Micro- Array: Collection of microscopic 
DNA spots attached to a solid surface forming an 
array for the purpose of expression profiling, which 
monitors expression levels for thousands of genes 
simultaneously. 

MAGE: Standard micro-array data model and 
exchange format that is able to capture information 
specified by MIAME. 

MGED: Ontology for micro-array experiments that 
establishes concepts, definitions, terms and resources 
for standardized description of a micro-array experi- 
ment in support of MAGE. 

MIAME: XML based standard for the description 
of micro-array experiments, which stands for Minimum 
Information About a Micro-array Experiment. 
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Ontology: The specification of a conceptualization 
of a knowledge domain. It's a controlled vocabulary 
that describes objects and the relations among them in 
a formal way, and has a grammar for using the vocabu- 
lary terms to express something meaningful within a 
specified domain of interest. 

Semantic Information Model Methodology: Set 

of activities, together with their inputs and outputs, 
aimed at the transformation of a collection of micro- 



array data sources into a semantically integrated and 
unified representation of the information stored in the 
data sources. 

Unifying Information Model (UIM): Construction 
that brings together all the physical data schemas related 
to the data sources to be integrated. It's built to represent 
an agreed-upon scientific view and vocabulary which 
will be the foundation to understand the data. 
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INTRODUCTION 



GLOBAL PATH PLANNERS 



The development of autonomous mobile robots is 
continuously gaining importance particularly in the 
military for surveillance as well as in industry for in- 
spection and material handling tasks. Another emerging 
market with enormous potential is mobile robots for 
entertainment. 

A fundamental requirement for autonomous mobile 
robots in most of its applications is the ability to navi- 
gate from a point of origin to a given goal. The mobile 
robot must be able to generate a collision-free path that 
connects the point of origin and the given goal. Some 
of the key algorithms for mobile robot navigation will 
be discussed in this article. 



BACKGROUND 

Many algorithms were developed over the years for 
the autonomous navigation of mobile robots. These 
algorithms are generally classified into three differ- 
ent categories: global path planners, local navigation 
methods and hybrid methods, depending on the type of 
environment that the mobile robot operates within and 
the robot's knowledge of the environment. 

In this article, some of the key algorithms for navi- 
gation of a mobile robot are reviewed. Advantages and 
disadvantages of these algorithms shall be discussed. 
The algorithms that are reviewed include the navigation 
function, roadmaps, vector field histogram, artificial 
potential field, hybrid navigation and the integrated 
algorithm. Note that all the navigation algorithms that 
are discussed in this article assume that the robot is 
operating in a planar environment. 



Global path planning algorithms refer to a group of 
navigation algorithms that plans an optimal path from a 
point of origin to a given goal in a known environment. 
This group of algorithms requires the environment to 
be free from dynamic and unforeseen obstacles. In this 
section, two key global path planning algorithms: navi- 
gation functions and roadmaps will be discussed. 

Navigation Functions 

The most widely used global path planning algorithm 
is perhaps the navigation function computed from the 
"wave-front expansion" (J.-C Latombe, 1991; Howie 
Choset et al, 2005) algorithm due to its practicality, ease 
in implementation and robustness. The navigation func- 
tion N is the Manhattan distance to the goal from the 
free space in the environment. The algorithm requires 
information of the environment provided to the robot 
to be represented as an array of grid cells. 



Figure 1. The shaded cells are the 1-Neighbor of the 
cell (x, y) and the number shows the priority of the 
neighboring cells 
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Figure 2. Path generated by the navigation function 



9 
1 


i 1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 17 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


17 18 


. 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


17 


18 19 


: 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


18 


17 


18 


19 20 


4 


5 


6 


7 


B 


9 


10 


11 


12 


13 


14 


15 


15 


17 


13 


19 


20 21 




L 6 


7 


E 


9 


10 


11 












17 


13 


19 


20 


21 22 


6 "^ 


8 


9 


10 


11 


12 






18 


19 


20 


21 


22 23 


7 B 




10 


11 


12 


13 


19 


20 


21 


22 


23 24 


8 g 


10 




12 


13 


14 


20 


21 


22 


23 


24 25 


9 10 


11 


12 




14 


15 


16 


21 


22 




1 


26 
27 
28 

2:9 
30 
31 
32 


10 


11 


12 


13 


14 




16 


17 


22 


23 


11 


12 




1 




16 




13 


23 


24 


12 


13 


17 


18 








24 


25 


13 


14 


18 


19 


20 


rr 


22 


^rr 


"24^ 


s25 


2& 


14 


15 


19 


20 I 


21 


22 


23 


24 


25 


2fes 


,27 


15 


16 


20 


21 


22 


23 


24 


25 


26 


27 




16 


17 


ie 


19 


20 


! 21 


22 


2:3 


24 


25 


26 


27 


28 


29 




31 


3? aa 


17 18 


19 


20 


21 


22 


23 


24 


25 


26 


27 


28 


29 


30 


31 


32 


33 34 




The navigation function assigns a numeric value N 
to each cell with the goal cell having the lowest value, 
and the other unoccupied cells having progressively 
higher values such that a steepest descent from any 
cell provides the path to the goal. The value of the 
unoccupied cell increases with the distance from the 
goal. Each grid cell is either free or occupied space 
denoted by qC f and qC . ,. First, the value of JV is 

J & free u occupied 

set to '0' at the goal cell gC oaV Next, the value of N 
is set to ' 1' for every 1 -Neighbor (see Figure 1 for the 
definition of 1 -Neighbors) of qC , which is in qC f . It 

/a g 0a [ ^1 f ree 

is assumed that the distance between two 1 -Neighbors 
is normalized to 1 . In general, the value of each qC f 

e> ' zj f ree 

cellissettoiV+1 (e.g., '2') for every unprocessed gC 

1 -Neighbor of the grid cell with value N (e.g., T). This 
is repeated until all the grid cells are processed. 

Finally, a path to the goal is generated by following 
the steepest descent of the JV values. To prevent the path 
from grazing the obstacles, the grid cells which are less 
than a safety distance a from the obstacles are omitted 
in the computation of the navigation function. Figure 

2 shows a path generated by the navigation function. 
The black cells are the obstacles and the grey cells are 
the unsafe regions. 

Roadmaps 

A roadmap is a network of one-dimensional curves 
that captures the connectivity of free space in the en- 



vironment (J.-C Latombe, 1991; Danner et al, 2000; 
Foskey et al, 200 1 ; Isto P., 2002; T. Simeon et al, 2004; 
Xiaobing Zou et al, 2004; Howie Choset et al, 2005; 
Bhattacharya et al, 2007). Once a roadmap has been 
constructed, it is used as a set of standardized paths. 
Path planning is thus reduced to connecting the initial 
and goal positions to points in the roadmap. Various 
methods based on this general idea have been proposed. 
They include the visibility graph (Danner et al, 2000; 
Isto P., 2002; T. Simeon et al, 2004), Voronoi diagram 
(Foskey et al, 2001; Xiaobing Zou et al, 2004; Bhat- 
tacharya et al, 2007), freeway net and silhouette (J.-C 
Latombe, 1991; Howie Choset et al, 2005). 

The visibility graph is the simplest form of road- 
map. This algorithm assumes that the environment is 
made up of only polygonal obstacles. The nodes of a 
visibility graph include the point of origin, goal and all 
the vertices of the obstacles in the environment. The 
graph edges are straight line segments that connect 
any two nodes within the line-of-sight of each other. 
Finally, the shortest path from the start to goal can be 
obtained from the visibility graph. 

Advantages and Disadvantages 

The advantage of the navigation functions, roadmaps 
and other global path planning algorithms is that a 
continuous collision-free path can always be found by 
analyzing the connectivity of the free space. However, 
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these algorithms require the environment to be known 
and static. Any changes in the environment could 
invalidate the generated path. Hence, the navigation 
functions and other global path planning algorithms 
are usually not suitable for navigation in an initially 
unknown environment and those with dynamic and 
unforeseen obstacles. 



LOCAL NAVIGATION METHODS 

In contrast to the global path planners, local navigation 
methods do not require a known map of the environ- 
ment to be provided to the robot. Instead, local navi- 
gation methods rely on current and local information 
from sensors to give a mobile robot online navigation 
capability. In this section, two of the key algorithms 
for local navigation: artificial potential field 'and vector 
field histogram will be evaluated. 

Artificial Potential Field 

The artificial potential field (O.Khatib, 1986) method, 
first introduced by Khatib, is perhaps the best known 
algorithm for local navigation of mobile robots due to 
its simplicity and effectiveness. The robot is represented 
as a particle in the configuration space q moving under 
the influence of an artificial potential produced by the 
goal configuration q and the scalar distance to the 
obstacles. Typically the goal generates an attractive 
potential such as 



u 9 ( q } = ~ K 9 V? " q g ) V? " <?a) 



(1) 



which pulls the robot towards the goal, and each obstacle 
i produces repulsive potential such as 



U ; , 



2 c 



K d , ^ 



if d, < do 
otherwise 



(2) 



which pushes the robot away from the obstacle. In cases 
where there is more than one obstacle, the total repulsive 
force is computed by the sum of all the repulsive forces 
produced by the obstacles. K and K o are the respective 
gains of the attractive and repulsive potential, d. is the 
scalar distance between the robot and obstacle z. The 
repulsive potential will only have effect on the robot 
when its moves to a distance which is lesser than d Q . 
This implies that d Q is the minimum safe distance from 
the obstacle that the robot tries to maintain. 

The negated gradient of the potential field gives the 
artificial force acting on the robot. 



F(q)=-VU(q) 

Figure 3 shows the attractive force 



(3) 



(4) 



Figure 3. Robots motion influenced by potential field 
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that is generated from the goal and the repulsive 
force 



F iM- 



K r 



K d t d 



if d t < d 



ojd^ 

i 

otherMise 



(5) 



that is generated from an obstacle z. F R is the resultant 
of all the repulsive forces and attractive force. Note that 
n denotes the total number of obstacles which is lesser 
than a distance d o from the robot. At every position, 
the direction of this force is considered as the most 
promising direction of motion for the robot. 



F R(*)= F gQl)+ £ 



F. (q) 

i,o v y 



(6) 



Vector Field Histogram 



The vector field histogram method (Y.Koren et al, 1991; 
J.Borenstien, 1991; Zhang Huiliang et al, 2003) requires 
the environment to be represented as a tessellation of 
grid cells. Each grid cell holds a numerical value that 
ranges from 0-15. This value represents whether the 
environment represented by the grid cell is occupied 
or not. indicates absolute certainty that the cell is not 
occupied and 15 indicates absolute certainty that the 
cell is occupied. A two stage data reduction process is 
carried out recursively to compute the desired motion 
of the robot at every instance of time. 

In the first stage, the values of every grid cells that 
are in the vicinity of the robot's momentary location 
are reduced to a one-dimensional polar histogram. 
Each bin from the polar histogram corresponds to a 
direction as seen from the current location of the robot 
and it contains a value that represents the total sum of 
the grid cell values along that direction. The values 
from the polar histogram are also known as the polar 
obstacle density and they represent the presence of 
obstacles in the respective directions. 

In the second stage, the robot selects the bin with a 
low polar obstacle density and direction closest to the 
goal. The robot moves in the direction represented by the 
chosen bin because this direction is free from obstacles 
and it will bring the robot closer to the goal. 



Advantages and Disadvantages 

The advantage of the artificial potential field, vector 
field histogram and other local navigation methods is 
that they do not include an initial processing step aimed 
at capturing the connectivity of the free space in a 
concise representation. Hence a prior knowledge of the 
environment is not needed. At any instant in time, the 
path is determined based on the immediate surrounding 
of the robot. This allows the robot to be able to avoid 
any dynamic obstacles in the robot's vicinity. 

The maj or drawback of the local navigation methods 
is that they are basically steepest descent optimization 
methods. This renders the mobile robot to be susceptible 
to local minima (Y.Koren et al, 1991; J.-O. Kim et al, 
1 992; Liu Chengqing et al, 2000; Liu Chengqing, 2002; 
Min Gyu Park et al, 2004). A local minimum in the 
potential field method occurs when the attractive and 
the repulsive forces cancel out each other. The robot 
will be immobilized when it falls into a local minimum, 
and loses the capability to reach its goal. Many methods 
have been proposed to solve the local minima problem 
(J.-O. Kim et al, 1992; Liu Chengqing et al, 2000; 
Liu Chengqing, 2002; Min Gyu Park et al, 2004). For 
example, Liu Chengqing (Liu Chengqing et al, 2000; 
Liu Chengqing, 2002) has proposed the virtual obstacle 
method where the robot detects the local minima and 
fills the area with artificial obstacles. Consequently, the 
method closes all concave obstacles and thus avoiding 
local minima failures. Another method was proposed by 
Jin-Oh Kim (J.-O. Kim et al, 1992) to solve the local 
minima problem. This method uses local minima free 
harmonic functions based on fluid dynamics to build 
the artificial potentials for obstacle avoidances. 



HYBRID METHODS 

Another group of algorithms suggest a combination of 
the local navigation and global path planning methods. 
These algorithms aim to combine the advantages from 
both the local and global methods, and to also eliminate 
some of their weaknesses. In this section, two key 
hybrid methods algorithms: hybrid navigation and 
integrated algorithm will be reviewed. 
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Figure 4. Illustration of the hybrid navigation algorithm 
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Hybrid Navigation 

Figure 4 shows an illustration of the hybrid navigation 
algorithm (Lim Chee Wang, 2002; Lim Chee Wang et 
al, 2002). This algorithm combines the navigation func- 
tion with the potential 'field method. It aims to eliminate 
local minima failures and at the same time does online 
collision avoidance with dynamic obstacles. 

The robot first computes the path joining its current 
position to the goal using the navigation function. The 
robot then places a circle with an empirical radius cen- 
tered at its current position. The cell that corresponds 
to the intersection of the circle with the navigation 
function path is known as the attraction point. The 
attraction point is the cell with the lowest JV value if 
there is more than one intersection. 

The robot advances towards the attraction point using 
the potential field method and the circle moves along 
with the robot which will cause the attraction point to 
change. As a result, the robot is always chasing after a 
dynamic attraction point which will progress towards 
the goal along the local minima free navigation function 
path. The radius of the circle is made larger to intersect 
the navigation function path in cases where no inter- 
sections are found. The radius of the circle is reduced 
to smaller than the distance between the robot and its 



goal when the distance between the robot and its goal 
becomes smaller than the radius of the circle. This is 
to make sure that the JV value of the next intersection 
will be smaller than the current JV value. 

Integrated Algorithm 

In recent years, the integrated algorithm (Lee Gim Hee, 
2005; Lee Gim Hee et al, 2007) has been proposed to 
give a mobile robot the ability to plan local minima free 
paths and does online collision avoidance to dynamic 
obstacles in an unknown environment. The algorithm 
modifies the frontier-based exploration method (Brian 
Yamauchi, 1997), which was originally used for map 
building, into a path planning algorithm in an unknown 
environment. The modified frontier-based exploration 
method is then combined with the hybrid navigation 
algorithm into a single framework. 

Figure 5 shows an overview of the integrated algo- 
rithm . The robot first builds a local map (see Part II of this 
article for details on map building) of its surrounding. 
It then decides whether the goal is reachable based on 
the acquired local map. The goal is reachable if it is in 
free space, and is not reachable if it is in the unknown 
space. Note that an unknown region is a part of the map 
which has not been explored during the map building 
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Figure 5. The integrated algorithm 
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process. The robot will advance towards the goal using 
the hybrid navigation algorithm if it is reachable or 
advance towards the sub-goal and build another local 
map at the sub-goal if the goal is not reachable. This 
map will be added to the previous local maps to form 
a larger map of the environment. The process goes on 
until the robot finds the goal within a free space. 

The sub-goal is computed in three steps. First, com- 
pute the path that joins the robot's current position and 
the goal using the navigation function. The unknown 
cells are taken to be free space in the computation of 
the navigation function. Second, all the frontiers in the 
map are computed. The boundary of free space and 
unknown region is known as the frontier. A frontier 
is made up of a group of adjacent frontier cells. The 
frontier cell is defined as any gC cell on the map with 
at least two unknown cells gC . as its immediate 

u unknown 

neighbor. The total number of frontier cells that make 
up a frontier must be larger than the size of the robot 
to make that frontier valid. Third, the frontier that in- 
tersects the navigation function path will be selected 
and its centroid chosen as the sub-goal. 



Advantages and Disadvantages 

The hybrid navigation algorithm has the advantage 
of eliminating local minima failures and at the same 
time doing online collision avoidance with dynamic 
obstacles. However, it requires the environment to be 
fully known for the search of a navigation function path 
to the goal. The algorithm will fail in a fully unknown 
environment. It also does not possess any capability to 
re -plan the navigation function path during an opera- 
tion. Therefore any major changes to the environment 
could cause failure in the algorithm. 

The integrated algorithm has the advantages of 
planning local minima free paths, does online collision 
avoidance to dynamic obstacles in totally unknown 
environments. In addition, the algorithm gives the 
mobile robot a higher level of autonomy since it does 
not depend on humans to provide a map of the environ- 
ment. However, the advantages come with the trade 
off of tedious implementations. This is because the 
integrated algorithm requires both the hybrid naviga- 
tion algorithm and a good mapping algorithm to be 
implemented at the same time. 
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CONCLUSION 

Mobile robot navigation involves more than planning 
a path from a point of origin to a given goal. A mobile 
robot must be able to follow the planned path closely 
and avoid any dynamic or unforeseen obstacles during 
its journey to the goal. Some of the key algorithms 
that give a mobile robot navigation capability were 
discussed in this article. These algorithms include the 
navigation function, roadmaps, artificial potential 
field, vector field histogram, hybrid navigation and the 
integrated algorithm. 



FUTURE TRENDS 

The assumption that on-board sensors have perfect 
sensing capability is generally made by researchers 
researching on mobile robot navigation. In reality, 
these sensors are corrupted with noise and this usu- 
ally causes adverse effects on the performance of the 
navigation algorithms. The greatest challenge for a 
robust implementation of the navigation algorithms 
is therefore to minimize the adverse effects caused by 
the sensor uncertainty. 



REFERENCES 

Bhattacharya, Priyadarshi, Gavrilova, & Marina L. 
(2007). Voronoi diagram in optimal path planning. 
4th International Symposium on Voronoi Diagrams in 
Science and Engineering, pp. 38 - 47. 

Brian Yamauchi. (1997). A frontier-based approach 
for autonomous exploration. Proceedings of the IEEE 
International Symposium on the Computational Intel- 
ligence in Robotics and Automation, pp. 146-151. 

Danner, T. Kavraki, & L.E. (2000). Randomized 
planning for short inspection paths. IEEE Interna- 
tional Conference on Robotics and Automation, pp. 
971-976. 

Foskey, M. Garber, M. Lin, & M.C. Manocha, D. (200 1 ). 
A Voronoi-based hybrid motion planner. Proceedings 
of International Conference on Intelligent Robots and 
Systems, pp. 55-60. 



Howie Choset, Kevin M. Lynch, Seth Hutchinson, 
George Kantor, Wolfram Bugrad, Lydia E. Kavraki, 
& Sebastian Thrun. (2005). Principles of robot mo- 
tion: Theory, algorithms, and implementations. MIT 
Press. 

Isto P. (2002). Constructing probabilistic roadmaps 
with powerful local planning and path optimization. 
IEEE/RSJ International Conference on Intelligent 
Robots and System. Vol. 3, pp. 2323 - 2328. 

J.Borenstien, & Y.Koren. (1991). The vector field his- 
togram - Fast obstacle avoidance for mobile robots. 
IEEE Journal of Robotics and Automation. Vol 7, No 

3, pp. 278-288. 

J.-C Latombe. (1991). Robot motion planning. Kluwer 
Academic Publishers. 

J.-O. Kim, & P. K. Khosla. (1992). Real-time obstacle 
avoidance using harmonic potential functions. Proceed- 
ings of IEEE Transactions on Robotics and Automation. 
Vol.8, pp. 338— 349. 

Lee Gim Hee, "Navigation of a mobile robot in an 
unknown environment ", Thesis for Bachelor of Engi- 
neering, National University of Singapore 2005. 

Lee Gim Hee, Lim Chee Wang, & Marcelo H. Ang Jr. 
(2007). An integrated algorithm for autonomous navi- 
gation of a mobile robot in an unknown environment. 
Third Humanoid, Nanotechnology, Information Tech- 
nology, Communication and Control Environment and 
Management (HNICEM) International Conference. 

Lim Chee Wang. (2002). Motion planning for mobile 
robots. Thesis for Master of Engineering, National 
University of Singapore. 

Lim Chee Wang, Lim Ser Yong, & Marcelo H. Ang 
Jr. (2002). Hybrid of global path planning and local 
navigation implemented on a mobile robot in indoor 
environment. Proceedings of the IEEE International 
Symposium on Intelligent Control, pp. 821-826. 

Liu Chengqing. (2002). Sensor based local path plan- 
ning for mobile robots. Master's Thesis, National 
University of Singapore. 

Liu Chengqing, Marcelo H. Ang Jr, H. Krishnan, & 
Lim Ser Yong (2000). Virtual obstacle concept for local 
minima recovery in potential field based navigation. 



1078 



Mobile Robots Navigation, Mapping, and Localization 



Proceedings of the IEEE Conference on Robotics and 
Automation. Vol. 2, pp. 983-988. 

Min Gyu Park , & Min Cheol Lee. (2004). Real-time 
path planning in unknown environment and a virtual 
hill concept to escape local minima. 30th Annual Con- 
ference of IEEE Industrial Electronics Society. Vol. 3, 

pp. 2223 - 2228. 

O.Khatib. (1986). Real-time obstacle avoidance for 
manipulators and mobile robots. International Journal 
of Robotic Research. Vol. 5, No. 1, pp.90-98. 

T. Simeon, J.-P. Laumond, & C. Nissoux. (2004). Vis- 
ibility-based probabilistic roadmaps for motion plan- 
ning. Journal of Advanced Robotics, Brill Academic 
Publishers, pp. 477-493. 

Xiaobing Zou, & Zixing Cai. (2004). Incremental 
environment modeling method based on approximate 
Voronoi diagram. Fifth World Congress on Intelligent 
Control and Automation. Vol.5, pp. 4618 - 4622. 

Y.Koren, & J.Borenstien. (1991) Potential field methods 
and their inherent limitations for mobile robot naviga- 
tion. Proceedings of the IEEE Conference on Robotics 
and Automation, pp.1398-1404. 

Zhang Huiliang, & Huang Shell Ying. (2003). Dy- 
namic map for obstacle avoidance. Proceedings of 
IEEE Intelligent Transportation Systems. Vol.2, pp. 
1152- 1157. 



KEY TERMS 

Global Path Planner: A group of navigation algo- 
rithms for planning an optimal path that connects a point 
of origin to a given goal in a known environment. 

Graph Edge: Graph edge is usually drawn as a 
straight line in a graph to connect the nodes. It is used 
to represent connectivity between two or more nodes 
and may carry additional information such as the Eu- 
clidean distance between the nodes. 

Graph Node: Graph Node is also known as graph 
vertex. It is a point on which the graph is defined and 
maybe connected by graph edges. 

Hybrid Methods: A group of navigation meth- 
ods that combine the global path planning and local 
navigation algorithms. The objective is to combine the 
advantages eliminate the inherent weaknesses of both 
groups of algorithms. 

Local Minima: It is also known as relative minima. 
Local minimum refers to a minimum within some 
neighborhood and it may not be a global minimum. 

Local Navigation Methods: A group of naviga- 
tion algorithms that do not require a known map of 
the environment to be provided to the robot. Instead, 
local navigation methods rely on current and local in- 
formation from sensors to give a mobile robot online 
navigation capability. 

Manhattan Distance: The distance between two 
points measured along axes at right angles. For example, 
given two points p : and p 2 in a two-dimensional plane at 
(x 1? y x ) and (x 2 , y 2 ) respectively, the Manhattan distance 
between p : and p 2 is given by |x x - x 2 | + \y 1 - y 2 |. 
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INTRODUCTION 

In addition to the capability to navigate from a point 
of origin to a given goal and avoiding all static and 
dynamic obstacles, a mobile robot must posses another 
two competencies: map building and localization in 
order to be useful. 

A mobile robot acquires information of its environ- 
ment via the process of map building. Map building for 
mobile robots are commonly divided into occupancy 
grid and topological maps. Occupancy-grid maps seek 
to represent the geometric properties of the environ- 
ment. Occupancy-grid mapping was first suggested by 
Elfes in 1987 and the idea was published in his Ph.D. 
thesis (A. Elfes, 1989) in 1989. Topological mapping 
was first introduced in 1 985 as an alternative to the oc- 
cupancy-grid mapping by R. Chatilaand J.-P. Laumond 
(R. Chatila, & J.-P. Laumond, 1 985). Topological maps 
describe the connectivity of different locations in the 
environment. 

The pose of a mobile robot must be known at all 
times for it to navigation and build a map accurately. 
This is the problem of localization and it was first de- 
scribed in the late 1 980 's by R. Smith et al (R. Smith et 
al, 1980). Some key algorithms for map building and 
localization will be discussed in this article. 



BACKGROUND 

Map building is the process of acquiring information of 
the environment via sensory data and representing the 
acquired information in a format that is comprehensible 
to the robot. The acquired map of the environment 
can be used by the robot to improve its performance 
in navigation. 

Localization is the process of finding the pose of 
the robot in the environment. It is perhaps the most 



important competency that a mobile robot must pos- 
sess. This is because the robot must know its pose in 
the environment before it can plan its path to the goal 
or follow a planned path towards the goal. 

In this article, two key algorithms for map building: 
occupancy-grid and topological mapping are discussed. 
The occupancy grid and topological maps are two 
different methodologies to represent the environment 
in a robot's memory. Two key localization methods: 
Localization with Kalman filter awl particle filter are 
also reviewed. 



MAP BUILDING 

As seen from the integrated algorithm from part I of 
the article, a mobile robot must be able to acquire maps 
of an unknown environment to achieve higher level of 
autonomy. Map building is the process where sensory 
information of the surrounding is made comprehensive 
to a mobile robot. In this section, two key approaches 
for map building: occupancy-grid and topological 
mapping are discussed. 

Occupancy-Grid Maps 

Occupancy-grid maps (H.P. Moravec, 1988; H.P. 
Moravec et al, 1989; A. Elfes, 1987, A. Elfes, 1989; 
S. Thrun et al, 2005) represent the environment as a 
tessellation of grid cells. Each of the grid cells corre- 
sponds to an area in the physical environment and holds 
an occupancy value which indicates the probability of 
whether the cell is occupied or free. The occupancy 
value of the z th grid cell at current time twill be denoted 
by p t .. Note thatp t . must be within the range of to 1 
following the axioms of probability. p t . = [0,0.5) indi- 
cates the confidence level of a cell being empty where 
indicates absolute certainty that the cell is empty. p ti 
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= (0.5,1] indicates the confidence level of a cell being 
occupied where 1 indicates absolute certainty that the 
cell is occupied. p t . = 0.5 indicates that the cell is an 
unexplored area. 

A robot does not have any knowledge of the world 
when it was first placed in an unknown environment. 
It is therefore intuitive to set p= 0.5 for all z at time t 
= 0. The map is updated via the log odds (S. Thrun et 
al, 2005) representation of occupancy. The advantage 
of log odds representation is that it can avoid numeri- 
cal instabilities for probability near or 1. The z th grid 
cell that intercepts the sensor line of sight is updated 
according to 



t,i t-l,i sensor 



(1) 



than the sensor measurement. The other cells in the 
map remain unchanged. 

Figure 1(a) illustrates the update process for the 
map. The cell that corresponds to the sensor measure- 
ment is shaded black and all the cells that intercept the 
sensor measurement beam are shaded white. Figure 
1(b) shows a case where the sensor measurement 
equals to maximum sensor range and / = /, for all 

1 c? sensor free 

cells that intercepts the sensor beam. This is because 
it is assumed that no obstacle is detected if the sensor 
measurement equals to maximum sensor range. l occ and 
/ are computed from 




L, = lo£ 



1-Po 



and / 



free 



log 



'free 



i-Pfre 



(3) 



where I . is the log odds computed from the occupancy 
value of the cell at t-1 . 



1 1— i,i 



log 



P t - 



1-Pr-y 



(2) 



/ = / if the cell corresponds to the sensor measure- 
sensor occ 

ment and / 



L if the range to the cell is shorter 

free ° 



where p and p ( denote the probabilities of the sensor 

r occ r free r 

measurement correctly deducing whether a grid cell 
is occupied or empty. The two probabilities must add 
up to 1 and their values depend on the accuracy of the 
sensor, p and p ( will have values closer to 1 and 

r occ r free 

for an accurate sensor. The values of p and p f have 

r occ r free 

to be determined experimentally and remain constant 
in the map building process. 



Figure 1. Updating an occupancy grid map (a) when an obstacle is detected (b) when a maximum range mea- 
surement is detected, i.e. it is assumed that in this case no obstacle is detected 
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Figure 2. Occupancy grid map of the corridor along 
block EA level 3 in the Faculty of Engineering of the 
National University of Singapore (NUS) 




The occupancy value of a grid cell is easily recov- 
ered from 



p t ,, =1 - 



l + exp{Z t ,} 



(4) 



Figure 2 shows an occupancy grid map of the corridor 
along block EA level 3 in the Faculty of Engineering of 
the National University of Singapore (NUS) acquired 
with a laser range finder. The black regions denote 
obstacles, white regions denote free space and grey 
regions denote unexplored areas. 

Topological Maps 

Unlike the occupancy grid maps, topological maps (D. 
Kortenkamp et al, 1994; H. Choset, 1996; H. Choset 
et al, 1996) do not attempt to represent the geometric 
information of the environment. Instead, topological 
maps represent the environments as graphs. An example 
of the topological map is shown in Figure 3. List of 
significant features such as walls, corners, doors or 
corridors are represented as nodes m. and connectivity 



Figure 3. Example of a topological map. The features are represented as nodes m.. The connectivity and distance 
between features are represented as edges u. k 
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between adjacent features is represented as edges u r 
In many topological maps, distances between adjacent 
features are also represented by the edges connecting 
the nodes. The success of the topological maps de- 
pends greatly on the efficiency in features extraction. 
Examples of feature extraction algorithms can be found 
in (Martin David Adams, 1999; Sen Zhang etal, 2003; 
Jodo Xavier et al, 2005). 

Topological maps are better choice for mapping if 
memory space is a major concern. This is because less 
memory is required to store the nodes as compared to 
the large number of grid cells in occupancy grid maps. 
The advantage of less memory consumption for the 
topological map however comes with the tradeoff of 
being less accurate. This is because some important 
information such as precise location of the free spaces 
in the environment may not be represented in the maps. 
The limited accuracy of topological maps thus restricts 
the robot's capability for fast and safe navigation. 



ness and efficiency. The EKF is a recursive algorithm 
for estimating the pose of the robot with noisy sensor 
readings. A key feature of the EKF is that it maintains 
a posterior belief bel(x) of the pose estimate, which 
follows a Gaussian distribution, represented by a mean 
x and covariance P f . The mean x represents the most 
likely pose of the robot at time t and covariance P rep- 
resents the error covariance of this estimate. The EKF 
consists of two steps: the prediction and update steps. 
In the prediction step, the predicted belief bel(x t ) is 
first computed using a motion model which describes 
the state dynamics of the robot. bel(x t ) is subsequently 
transformed into bel(x t ) by incorporating the sensor 
measurements in the update step. 

As mentioned above, the predictedbelief bel(x t )which 
is represented by the predicted mean x t and covariance 
P t is computed from the prediction step given by 




X t = f( X t-l>Ut) 



(5) 



LOCALIZATION 

Most mobile robots localize theirposex t with respect to 
a given map based on odometry readings . Unfortunately, 
wheel slippages and drifts cause incremental localiza- 
tion errors (J. Borenstein et al, 1995; J. Borenstein et 
al, 1996). These errors cause the mobile robot to lose 
track of its own pose and hence losing the ability to 
navigate autonomously from one given point in the map 
to another. The solution to the localization problem is 
to make use of information of the environment from 
additional sensors. Examples of sensors used are laser 
range finder and sonar sensor that measure the distance 
between the robot and the nearest obstacles in the 
environment. The extended Kalman filter (EKF) and 
particle filter are two localization algorithms that use 
odometry and additional sensory data of the environ- 
ment to localize a mobile robot. Both algorithms are 
probabilistic methods that allow uncertainties from the 
robot pose estimate and sensor readings to be accounted 
for in a principled way. 

Localization with Extended Kalman Filter 

EKF (John J. Leonard et al, 1991; A.Kelly, 1994; G. 
Welch etal, 1995; Martin David Adams, 1999; S.Thrun 
et al, 2005) is perhaps the most established algorithm 
for localization of mobile robots because of its robust- 



P t =F t P t _ t F? +Q t 



(6) 



where f(.) is the motion model of the mobile robot, F 
is the Jacobian of f(.) evaluated at x tl , Q t is the covari- 
ance of the motion model and u t is the control data of 
the robot. 

bel(x t ) is subsequently transformed into bel(x) by 
incorporating the sensor measurement z t into the update 
step of the EKF shown in Equations 7, 8 and 9. 



K t = PrfiH^H'+Ry 



x t = x t +K t (z t -h(x t ,m)) 



P. = 



(I-K t H t )P t 



(7) 
(8) 
(9) 



K f , computed in Equation 7, is called the Kalman gain. 
It specifies the degree to which z f should be incorporated 
into the new pose estimate. Equation 8 computes x f by 
adjusting it in proportion to K t and the deviation of the z t 
with the predicted measurement h(x t , m) . It is important 
to note that the sensor measurement z t = [z) z] ...] T 
refers to coordinates of a set of observed landmarks 
instead of the raw sensor readings and the sensor mea- 
surement model /i(.) gives the predicted measurement 
from the given topological map m and x t . H is the 
Jacobian of h(.) evaluated at x t _ r Finally, the covariance 
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P of the posterior belief bel(x) is computed in Equation 
9 by adjusting for the information gain resulting from 
the sensor measurements. 

Localization with Particle Filter 

In the recent years, there is an increasing interest in the 
use of particle filter (S. Thrun et al, 2001; C. Kwok et 
al, 2002; D. Fox et al, 2003 ; Ioannis M. Rekleitis, 2004; 
S. Thrun et al, 2005) over EKF for robot localization. 
This increased interest is likely due to four reasons. 
First, raw sensor measurements of the environment 
are used in particle filter localization where the EKF 
localization requires feature extraction. Second, the 
particle filter is more robust because unlike the EKF, it 
does not assume Gaussian distribution for the posterior 
belief bel(x). Third, the particle filter is able to recover 
from localization failure. Localization failure occurs 
if the robot suddenly loses track of its pose during 
the localization process. Localization failure is also 
known as the kidnapped problem. Fourth, unlike the 
EKF there is no need to derive complicated Jacobians 
for the particle filter. 

The intuition behind the particle filter is to represent 
the posterior belief bel(x) by a finite sample set of M 
weighted particles. This sample set is drawn according 
to be/(x t ). The particles set is denoted by 



? = y [1] v [2] 

T>t At > At ' 



[M] 



where 



'At 



Jrn]-iT 



(10) 



denotes the m th particle. Here, x\ m] is a random vari- 
able that represents a hypothesized state and w[ m] is a 
non-negative value called the importance factor which 
represents the weight of each particle. Similar to the 
EKF, the particle filter consists of the prediction and 
update steps. In the prediction step, samples of the 
particles are drawn from a motion model of the robot 
to represent the predicted belief bel(x t ). The particles 
are then weighted according to the sensor measurements 
in the update step. Finally, be/(x t )is transformed into 
the posterior belief bel(x) by resampling the particles 
according to their weights. 

Table 1 shows an iteration of the recursive particle 
filter algorithm for localization. The inputs to the particle 
filter are the set of particles representing the previous 
state belief ^ fl , the most recent control actions u t and 
measurement data z t . Line 3 is the prediction step that 
generates the hypothetical state x [ t m] by sampling from 
the motion model p(x t | ii t ,x[™ ] ) of the robot. The set of 
particles obtained after M iterations represents bel(x t ) 
. Line 4 computes w\ m] from the sensor measurement 



Table 1. Pseudo algorithm for mobile robot localization with particle filter 



U,=$,=0; 






2. for m = 1 to M do 






3. generate random sample of x| m] from p(x t u t , x t [ ™ ] ) ; 






4. w: mI =p(z t |x| m] , m ); 






5. Z t [mI =[* t [mI vv t [m] f; 






6. end; 






7. for m = 1 to M do 






8. draw% t [m] from t, t with probability proportional to w[ 1] , 


w[ 2] ,... 


w [MK 


9. end; 
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model. The importance factor accounts for the mismatch 
between bel(x t ) and bel(x). Finally, the resampling 
process from line 7 to 9 draws with replacement M 
particles from the temporary set^ t with a probability 
proportional to the importance factors. The distribution 
of bel(x t )is transformed into bel(x) by incorporating 
the importance factors in the resampling process. 



Figure 4(a) to (d) shows an implementation result of 
a robot localizing itself in a corridor. The particle set is 
initialized to the initial known pose of the robot show 
in Figure 4(a). The particles are initialized uniformly 
within a circle with radii 100mm and the initial position 
of the robot is taken as the center. The orientation of 
the particles is also initialized uniformly within ±5° to 




Figure 4. Implementation of the particle filter to solve the localization problem. Notice that the error from the 
odometry grows as the robot travels a greater distance 
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the initial orientation of the robot. This is to eliminate 
possible errors in estimating the initial pose of the 
robot. Figure 4(b) to 4(d) show that the error from the 
odometry grows as the robot travels a greater distance. 
The robot thinks that it is traveling in occupied space 
if it relied solely on the odometry readings and this is 
obviously wrong. It is apparent that the particle filter 
gives a more reasonable pose estimate because the robot 
is always moving within the free space. 

It was mentioned earlier that the particle filter is 
able to recover from localization failure. An example 
of localization failure is when the robot is pushed by 
human resulting in a mismatch between the true and 
estimated pose of the robot. Fortunately, the problem 
can be easily solved by observing the total weights of 
the filter after each iteration. Localization failures will 
cause sharp drops in the total weights of the particles. 
The particles are re-initialized uniformly in the free 
space after detecting a sharp drop in the total weights 
of the particles. The particles will eventually converge 
to the true pose of the robot. 

The particle filter is a powerful algorithm in solving 
the localization problem. However, it must be noted 
that the number of particles used to represent beliefs 
is an important parameter for efficiency of the particle 
filter in recovering from localization failures. A large 
size of particles is necessary to recover from localiza- 
tion failures in large environments and in many cases 
the maximum number particles is restricted by the 
available computing resources. This problem is also 
known as the curse of dimensionality. 



CONCLUSION 

A mobile robot has to possess three competencies to 
achieve full autonomy: navigation, map building and 
localization. Over the years, many algorithms have 
been proposed and implemented with notable success 
to give mobile robots all the three competencies. Some 
of the key algorithms such as the navigation function, 
roadmaps, artificial potential field, vector field histo- 
gram, hybrid navigation and the integrated algorithm 
for navigation; occupancy grid and topological based 
mapping; as well as the Kalman filter and particle 
filter for localization are reviewed in both Part I and 
II of this article. 



FUTURE TRENDS 

While the navigation, map building and localization 
algorithms are implemented with notable success, the 
scale and structure of the environments for these algo- 
rithms to work are limited. Hence, the future challenges 
for mobile robot autonomy are in the implementations 
of the algorithms in larger scale and more complex 
environments such as the urban cities or jungles. 
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KEY TERMS 

Curse of Dimensionality: This term was first 
used by Richard Bellman. It refers to the problem of 
exponential increase in volume associated with adding 
extra dimensions to a mathematical space. 

Gaussian Distribution: It is also known as normal 
distribution. It is a family of continuous probability dis- 
tributions where each member of the family is described 
by two parameters: mean and variance. This form of 
distribution is used by the localization with extended 
Kalman filter algorithm to describe the posterior belief 
distribution of the robot pose. 

Jacobians: The Jacobian is a first-order partial 
derivatives of a function. Its importance lies in the 
fact that it represents the best linear approximation to 
a differentiable function near a given point. 

Odometry: A method to do position estimation for 

a wheeled vehicle during navigation by counting the 
number of revolutions taken by the wheels that are in 
contact with the ground. 

Posterior Belief: It refers to the probability dis- 
tribution of the robot pose estimate conditioned upon 
information such as control and sensor measurement 
data. The extended Kalman filter and particle filter 
are two different methods for computing the posterior 
belief. 

Predicted Belief: It is also known as the prior belief . 
It refers to the probability distribution of the robot pose 
estimate interpreted from the known control data and 
in the absence of the sensor measurement data. 

Recursive Algorithm: It refers to a type of computer 
function that is applied within its own definition. The 




1087 



Mobile Robots Navigation, Mapping, and Localization 



extended Kalman filter and particle filter are recursive current time step are used as inputs in the next time 
algorithms because the outputs from the filters at the step. 
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INTRODUCTION 



BACKGROUND: MODAL LOGICS 



It becomes evident in recent years a surge of interest 
to applications of modal logics for specification and 
validation of complex systems. It holds in particular 
for combined logics of knowledge, time and actions 
for reasoning about multiagent systems (Dixon, Nalon 
& Fisher, 2004; Fagin, Halpern, Moses & Vardi, 1995; 
Halpern & Vardi, 1986; Halpern, van der Meyden & 
Vardi, 2004; vanderHoek& Wooldridge, 2002; Lomus- 
cio, & Penczek, W., 2003; van der Meyden & Shilov, 
1999; Shilov, Garanina & Choe, 2006; Wooldridge, 
2002). In the next paragraph we explain what are log- 
ics of knowledge, time and actions from a viewpoint 
of mathematicians and philosophers. It provides us a 
historic perspective and a scientific context for these 
logics. 

For mathematicians and philosophers logics of ac- 
tions, time, and knowledge can be introduced in few 
sentences. A logic of actions (ex., Elementary Proposi- 
tional Dynamic Logic (Harel, Kozen & Tiuryn, 2000)) 
is a polymodal variant of a basic modal logic K (Bull 
& Segerberg, 2001) to be interpreted over arbitrary 
Kripke models. A logic of time (ex., Linear Temporal 
Logic (Emerson, 1 990)) is a modal logic with a number 
of modalities that correspond to "next time", "always", 
"sometimes", and "until" to be interpreted in Kripke 
models over partial orders (discrete linear orders for 
LTL in particular). Finally, a logic of knowledge or 
epistemic logic (ex., Propositional Logic of Knowledge 
(Fagin, Halpern, Moses & Vardi, 1 995 ; Rescher, 2005)) 
is a polymodal variant of another basic modal logic S5 
(Bull & Segerberg, 200 1 ) to be interpreted over Kripke 
models where all binary relations are equivalences. 



All modal logics are languages that are characterized 
by syntax and semantics. Let us define below a very 
simple modal logic in this way. This logic is called El- 
ementary Propositional Dynamic Logic (EPDL). 

Let true, false be Boolean constants, Prp and Rel 
be disjoint sets of propositional and relational variable 
respectively. The syntax of the classical propositional 
logic consists of formulas which are constructed from 
propositional variables and Boolean connectives "— ■" 
(negation), "&" (conjunction), "v" (disjunction), "^>" 
(implication), and "<-V (equivalence) in accordance 
with the standard rules. EPDL has additional formula 
constructors, modalities, which are associated with 
relational variables: if r is a relational variable and 9 
is a formula of EPDL then 

(frjip) is a formula which is read as "box r-cp" or 
"after r always 9"; 

((r)q>) is a formula which is read as "diamond 
r-cp" or "after r sometimes 9". 

The semantics of EPDL is defined in models, which 
are called labeled transition systems by computer 
scientists and Kripke models 1 by mathematicians 
and philosophers. A model M is a pair (D , I) where 
the domain (or the universe) D^0 is a set, while the 
interpretation I is a pair of mappings (P ,R). Elements 
of the domain D are called states by computer scientists 
and worlds by mathematicians and philosophers. The 
interpretation maps propositional variables to sets of 
states P: Prp^2 D and relational variables to binary 
relations on states R: Rel^>2 DxD . We write I(p) and 
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I(r) instead of P(p) and R(r) whenever it is implicit 
that p and r are propositional and relational variables 
respectively. 

Every model M=(D ,1) can be viewed as a directed 
graph with nodes and edges labeled by propositional 
and action variables respectively. Its nodes are states 
of D. A node seD is marked by a propositional vari- 
able pePrp iff sel(p). A pair of nodes (s^sJeDxD is 
an edge of the graph iff (s p sj el(r) for some relational 
variable reRel; in this case the edge (s^sj is marked 
by this relational variable r. Conversely, a graph with 
nodes and edges labeled by propositional and relational 
variables respectively can be considered as a model. 

For every model M = (D,I) the entailment (validity, 
satisfiability) relation |= M between states and formulas 
can be defined by induction on formula structure: 



for every state s |= M true and not s |= M false] 
for any state s and propositional variable p, s |= M 
p iff sel(p); 

for any state s and formula 9, s |= M (-\(p) iff it is 
not the case s f= M 9 ; 
for any state s and formulas 9 and \|/, 
Yep &y) if f s fyp and s fy|/ ; 

for any state s, relational variable r, and formula 

s K (My) iff ( s > s ') eI ( r ) and s ' K <p for ever y 

state s' ; 

s |= M ((r)q>) iff (s,s f ) el(r) and s r |= M (p for some state 

s'. 



Semantics of the above kind is called possible 
worlds semantics. 

Let us explain EPDL pragmatics by the following 
puzzle example. 

Alice and Bob play the Number Game. Positions in 
the game are integers in [1..109]. An initial position 
is a random number. Alice and Bob make alternating 
moves: Alice, Bob, Alice, Bob, etc. Available moves 
are same for both: if a current position is ne[1..99] 
then (n+1) and (n+10) are possible next positions. A 
player wins the game iff the opponent is the first to 
enter [100. .109]. Problem: Find all initial positions 
where Alice has a winning strategy. 

Kripke model for the game is quite obvious: 



States correspond to game positions, i.e. integers 

in [1..109]. 

Propositional variable fail is interpreted by 

[100..109]. 

Relational variable move is interpreted by possible 

moves. 

Formula -^fail & (move)(—fail & [move] fail) is valid 
in those states where the game is not lost, there exists 
a move after which the game is not lost, and then all 
possible moves always lead to a loss in the game. Hence 
this EPDL formula is valid in those states where Alice 
has a 1 -round winning strategy against Bob. 



COMBINING KNOWLEDGE, ACTIONS 
AND TIME 

Logic of Knowledge 

Logics of knowledge are also known as epistemic 

logics. One of the simplest epistemic logic is Propo- 
sitional Logic of Knowledge for n>0 agents (PLK n ) 

(Fagin, Halpern, Moses & Vardi, 1995). A special 
terminology, notation and Kripke models are used 
in this framework. A set of relational symbols Rel in 
PLK n consists of natural numbers [l..n] representing 
names of agents. Notation for modalities is: if ze [1.. 
n] and 9 is a formula, then (Ki q>) and (Si q>) are used 
instead of (fij cp) and ((i) cp). These formulas are read 
as "( an agent) z knows 9" and "( an agent) z can sup- 
pose 9". For every agent ie [l..n] in every model M 
= (D, I), interpretation /(%) is an "indistinguishability 
relation", i.e. an equivalence relation 2 between states 
that the agent i can not distinguish. Every model M, 
where all agents are interpreted in this way, is denoted 
as (D, ~ , . . . ~ n , I) with explicit 1(1) =-,... I(n) = ~ n 
instead of brief standard notation (D,I) . An agent knows 
some "fact" 9 in a state s of a model M, if the fact is 
valid in every state s' of this model that the agent can 
not distinguish from s: 



s K ( K i v) f ff s ' K <p for ever y state s '^ 



. s. 



Similarly, an agent can suppose a "fact" 9 in a state 
s of a model M, if the fact is valid in some state s' of 
this model that the agent can not distinguish from s: 
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s [= M (S. q>; iffs' |= M cp for some state s'~ s. 

The above possible worlds semantics of knowledge 
is due to pioneering research (Hintikka, 1962). 

Temporal Logic with Actions 

Another propositional polymodal logic is Computa- 
tional Tree Logic with actions (Act-CTL). Act-CTL 
is a variant of a basic propositional branching time 
temporal logic Computational Tree Logic (CTL) 
(Emerson, 1990; Clarke, Grumberg & Peled, 1999). 
In Act-CTL the set of relational symbols consists of 
action symbols Act. Each action symbol can be inter- 
preted by an "instant action" that is executable in one 
undividable moment of time. 

Act-CTL notation for basic modalities is: if be Act 
and 9 is a formula, then , (A b X q>) and (EX q>) are used 
instead of (fbj q>) and ((b) q>). But syntax of Act-CTL 
has also some other special constructs associated with 
action symbols: if be Act and 9 and \|/ are formulas, 
then (A b G q>;, (A b F q>;, (Efi q>), (E b F q>;, A/cp Uy) and 
E b (<p U \\r) are also formulas of Act-CTL. In formulas 
of Act-CTL prefix "A" is read as "for every future", 
"E" - "for some future", suffix "X" - "next state", 
"G" - "always" or "globally", "F" - "sometimes" or 
"future", the infix "IT - "until", and a sub-index "b" 
is read as "in b-run(s)". 

We have already explained semantics of (A b Xq>) and 
(E b X q>) by referencing to (fbj q>) and ((b) cp). Constructs 
<{ A b G ", (< A b F ", (< E b G ", and (< E b F " can be expressed in 
terms of "A/... [/.../' and "E/... U...)" , for example: 
(E b Fq) <r^ E b (true U 9). Thus let us define below se- 
mantics of "A/... U...)" and "£/... U...)" only. LetM = 
(D, I) be a model. If be Act is an action symbol, then 
a partial b-run is a sequence of states s ,. . . s k ,s ,. . . e 
D (maybe infinite) such that (s k ,s ) e 1(b) for every 
consecutive pair of states within this sequence. If be 
Act is an action symbol, then a b-run is an infinite partial 
b-run or finite b-run that can not be continued 3 . Then 
semantics of constructs "A/... U...)" and "£/... £/.../' 
can be defined as follows: 

s f= M A b (9 17 \|/^ iff for every b-run s , . . .s k , . . . that 
starts in s (i.e. s =s) there exists some n>0 for 
which s n fy|/ and s k fyp for every ke[0..(n-l)J; 
s |= M £ b (q> L7 vk) iff for some b-run s , . . .s k , . . . that 
starts in s (i.e. s =s) there exists some n>0 for 
which s n fy|/ and s k |= M 9 for every ke[0..(n-l)J. 



The standard branching-time temporal logic C7X 
can be treated as Act-CTL with a single implicit ac- 
tion symbol. 

Combined Logic of Knowledge, Actions 
and Time 

There are many combined polymodal logics for 

reasoning about multiagent systems. Maybe the most 
advanced is Belief-Desire-Intention (BDI) logic 

(Wooldridge, 1996; Wooldridge, 2002). An agent's 
beliefs correspond to information the agent has about 
the world. (This information may be incomplete or 
incorrect. An agent's knowledge in BDI is just a true 
belief.) An agent's desires correspond to the allocated 
tasks. An agent's intentions represent desires that it 
has committed to achieving. Admissible actions are 
actions of individual agents; they may be constructed 
from primitive actions by means of composition, non- 
deterministic choice, iteration, and parallel execution. 
But semantics of BDI and reasoning in BDI are quite 
complicated for a short encyclopedia article. 

In contrast, let us discuss below a simple example 
of a combined logic of knowledge, actions and time 
- namely Propositional Logic of Knowledge and 
Branching Time for n>0 agents Act-CTL-K n (Ga- 
ranina, Kalinina, & Shilov, 2004; Shilov, Garanina & 
Choe, 2006; Shilov & Garanina, 2006). First we provide 
a formal definition of Act-CTL-K , then discuss some 
pragmatics, and then - in the next section - introduce 
model checking as a reasoning mechanism. 

Let [l..n] be a set of agents (n > 0), and Act be a 
finite alphabet of action symbols. Syntax of Act-CTL- 
K n admits epistemic modalities K , and S for every 
ie[l..n], and branching-time constructs A b X, E b X, A b G, 
E b G, A b F, E b F, A/... [/...;, and E/...U...) for every 
be Act. Semantics is defined in terms of entailment in 
environments. An (epistemic) environment is a tuple 
E = (D, ~ t , . . . ~ n , I) such that (D, ~ I? . . . ~J is a model 
for PLK n , and (D, I) is a model for Act-CTL. Entail- 
ment relation ^ is defined by induction according to 
the standard definition for propositional connectives 
(see semantics of EPDL), and the above definitions of 
epistemic modalities and branching time constructs. 

We are mostly interested in trace-based perfect recall 
synchronous environments generated from background 
finite environments. "Generated" means that possible 
"worlds" are runs of finite-state machine(s). There 
are several opportunities how to define semantics of 
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combined logics on runs. In particular, there are two 
extreme cases: Forgetful Asynchronous Systems 
(FAS) and Synchronous systems with Perfect Recall 

(PRS). "Perfect recall" means that every agent has a 
log-file with all his/her observations along a run, while 
"forgetful" means that information of this kind is not 
available. "Synchronous" means that every agent can 
distinguish runs of different lengths, while "asynchro- 
nous" means that some runs of different lengths may 
be indistinguishable. 

It is quite natural that in the FAS case combined 
logic Act-CTL-K n can express as much as it can express 
in the background finite system. In contrast, in the 
PRS case Act-CTL-K n becomes much more expressive 
than in the background finite environment. Importance 
of combined logics in the framework of trace-based 
semantics with synchronous perfect recall rely upon 
their characteristic as logics of agent's learning or 
knowledge acquisition. We would like to argue this 
characteristic by the following single-agent 4 Fake Coin 
Puzzle FCP(N,M). 

A set consists of(N+ 1) enumerated coins. The last coin 
is a valid one. A single coin with a number in [1..N] is 
fake, but other coins with numbers in [l...(N+l)J are 
valid. All valid coins have the same weight that differs 
from the weight of the fake. Is it possible to identify the 
fake by balancing coins M times at most? 

In FCP(N,M) the agent (i.e. a person who have to 
solve the puzzle) does not know neither a number of the 
fake, nor whether it is lighter or heavier than the valid 
coins. Nevertheless, this number is in [1..N], and the 
fake coin is either lighter (/) or heavier (h). The agent 
can make balancing queries and read balancing results 
after each query. Every balancing query is an action 
b n _. which consists in balancing of two disjoint sets of 

(L,KJ 

coins: with numbers Lcz[l..N+l] on the left pan, and 
with numbers Rci[l..N+l] on the right pan, \L\ = \R\. 
There are three possible balancing results: "<", ">", 
and " = ", which means that the left pan is lighter, heavier 
than or equal to the right pan, respectively. Of course, 
there are initial states (marked by ini) which represent 
a situation when no query has been made. 

Let us summarize. The agent acts in the environment 
generated from a finite space [l..N]x{l,h}x{<, >, =, 
ini}. His/her admissible actions are balancing query 



b n m for disjoint L, Tfe [1..N+1] with LLI = \R\. The 

(L,KJ 

only information available for the agent (i.e., which 
gives him/her an opportunity to distinguish states) is a 
balancing result. The agent should learn fake_coin_num- 
ber from a sequence which may start from any initial 
state and then consists of M queries and corresponding 
results. Hence single agent logic Act-CTL-K 1 seems to 
be a very natural framework for expressing FCP(NM) 
as follows: to validate or refute whether 



(E B X... M _ lime ,..E B X(v 



/e[l.J\T] 



K ± ( fake coin number = f ))...)) 

for every initial state s, where E is a PRS environment 
generated from a finite space [l..N]x{l,h}x{<, >, =, 
ini}, and B is a balancing query u^^^b^. 



FUTURE TRENDS: MODEL CHECKING 
FOR COMBINED LOGICS 

The model checking problem for a combined logic 
(Act-CTL-K n in particular) and a class of epistemic 
environments (ex., PRS or FAS environments) is to 
validate or refute s ^ E 9 , where E is a finitely-gener- 
ated environment in the class, s is an "initial state" of 
the environment E, and 9 is a formula of the logic. 
The above re-formulation of FCP(N,M) is a particular 
example of a model checking problem for a formula 
of Act-CTL-K n and some finitely-generated perfect 
recall environment. 

Papers (Meyden & Shilov, 1999) and (Garanina, 
Kalinina & Shilov, 2004) have demonstrated that if the 
number of agents n > 1 , then the model checking problem 
in perfect recall synchronous systems is very hard or 
evenundecidable. In particular, it has non-elementary 5 
upper and lower time bounds for Act-CTL-K n . Papers 
(Meyden & Shilov, 1999) and (Shilov, Garanina & 
Choe, 2006) have suggested a tree-like data structures 
to make "feasible" model checking of combinations 
of temporal and action logics with propositional logic 
of knowledge PLK n . Alternatively, (van der Hoek & 
Wooldridge, 2002; Lomuscio & Penczek, 2003) have 
susggested either to simplify language of logics to 
be combined, or to consider agents with "bounded" 
recall. 
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CONCLUSION 

Combinations of temporal logics and logics of actions 
with logics of knowledge become an actual research 
topic due to the importance of study of interactions 
between knowledge and actions for reasoning about 
real-time multiagent systems. A comprehensive survey 
of logics, techniques, and results was out of scope of 
the article. The primary target of present article was to 
provide semi-formal introduction to the field of com- 
bined modal logics, discuss their utility for reasoning 
about multiagent systems. The emphasis has been done 
on model checking of trace-based knowledge-temporal 
specifications of perfect recall synchronous systems. 
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KEY TERMS 

Environment: A labeled transition system that 
provides an interpretation for logic of knowledge, ac- 
tions and time simultaneously. 

Labeled Transition Systems or Kripke Model: 

An oriented labeled graph (infinite maybe). Nodes of 
the graph are called states or worlds, some of them are 
marked by propositional symbols that are interpreted 
to be valid in these nodes. Edges of the graph are 
marked by relational symbols that are interpreted by 
these edges. 
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Logic of Actions: Apolymodal logic that associate 
modalities like "always" and "sometimes" with action 
symbols that are to be interpreted in labeled transi- 
tion systems by transitions. A so-called Elementary 
Propositional Dynamic Logic (EPDL) is sample logic 
of actions. 

Logic of Knowledge or Epistemic Logic: Apoly- 
modal logic that associate modalities like "know" and 
"suppose" with enumerated agents or groups of agents. 
Agents are to be interpreted in labeled transition systems 
by equivalence "indistinguishability" relations. A so- 
called Propositional Logic of Knowledge of n agents 
(PLK n ) is sample epistemic logic. 

Logic of Time or Temporal Logic: A polymodal 
logic with a number of modalities that correspond to 
"next time", "always", "sometimes", and "until" to be 
interpreted in labeled transition systems over discrete 
partial orders. For example, Linear Temporal Logic 
(LTL) is interpreted over linear orders. 

Model Checking Problem: An algorithmic problem 
to validate or refute a property (presented by a formula) 
in a state of a model (from a class of Kripke structures). 
For example, model checking problem for combined 
logic of knowledge, actions and time in initial states of 
perfect recall finitely generated environments. 



Multiagent System: A collection of communicat- 
ing and collaborating agents, where every agent have 
some knowledge, intensions, enabilities, and possible 
actions. 

Perfect Recall Synchronous Environment: An 

environment for modeling a behavior of a perfect recall 
synchronous system. 

Perfect Recall Synchronous System: A multiagent 
system where every agent always records his/her ob- 
servation at all moments of time while system runs. 



ENDNOTES 

1 Due to pioneering papers of Saul Aaron Kripke 
(born in 1940) on models for modal logics. 

2 A symmetric, reflexive, and transitive binary 
relation on D. 

3 That is for the last state s there is no state s' such 
that fas') el(b). 

4 For multiagent example refer Muddy Children 
Puzzle (Fagin, Halpern, Moses & Vardi, 1995). 

5 I.e. it is not bounded by a tower of exponents with 
any fixed height 
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INTRODUCTION 

The concept of modularity is a main concern for the 
generation of artificially intelligent systems. Modu- 
larity is an ubiquitous organization principle found 
everywhere in natural and artificial complex systems 
(Callebaut, 2005). Evidences from biological and 
philosophical points of view (Caelli and Wen, 1999) 
(Fodor, 1983), indicate that modularity is a requisite 
for complex intelligent behaviour. Besides, from an 
engineering point of view, modularity seems to be the 
only way for the construction of complex structures. 
Hence, whether complex neural programs for complex 
agents are desired, modularity is required. 

This article introduces the concepts of modularity 
and module from a computational point of view, and 
how they apply to the generation of neural programs 
based on modules. Two levels, strategic and tactical, at 
which modularity can be implemented, are identified. 
How they work and how they can be combined for 
the generation of a completely modular controller for 
a neural network based agent is presented. 



BACKGROUND 

When designing a controller for an agent, there exists 
two main approaches: a single module contains all the 
agent's required behaviours (monolithic approach), or 
global behaviour is decomposed into a set of simpler 
sub-behaviours, each one implemented by one module 
(modular approach). Monolithic controllers implement 
on a single module all the required mappings between 
the agent's inputs and outputs. As an advantage, it is 
not required to identify required sub-behaviours nor 
relations between them. As a drawback, whether the 
complexity of the controller is high, it could be impos- 
sible at practice to design such a controller without 
obtaining large interferences between different parts of 



it. Instead, when a modular controller is used, the global 
controller is designed by a group of sub-controllers, 
so required sub-controllers and their interactions for 
generating the final global output must be defined. 

Despite the disadvantages of the modular approach 
(Boers, 1 992), complex behaviour cannot be achieved 
without some degree of modularity (Azam, 2000). 
Modular controllers allow the acquisition of new 
knowledge without forgetting previously acquired 
one, which represents a big problem for monolithic 
controllers when the number of required knowledge 
rules to be learned is large (De Jong et al., 2004). They 
also minimize the effects of the credit assignment 
problem, where the learning mechanism must provide 
a learning signal based on the current performance of 
the controller. This learning signal must be used to 
modify the controller parameters which will improve 
the controller behaviour. In large controllers, it becomes 
difficult finding changing parameters of the controller 
based on the global learning signal. Modularization 
helps to keep small the controllers' size, minimizing 
the effect of the credit assignment. 

Modular approaches allow for a complexity reduc- 
tion of the task to be solved (De Jong et al. , 2004) . While 
in a monolithic system the optimization of variables 
is performed at the same time, resulting in a large 
optimization space, in modular systems, optimization 
is performed independently for each module resulting 
on reduced searching spaces. Modular systems are 
scalable, in the sense that former modules can be used 
for the generation of new ones when problems are 
more complex, or just new modules can be added to 
the already existing ones. It also implies that modular 
systems are robust, since the damage on one module 
results in a loss of the abilities given by that module, 
but the whole system is partially kept functioning. 
Modularity can be a solution to the problem of neural 
interference (Di Ferdinando et al., 2000), which is 
encountered in monolithic networks. This phenomenon 
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is produced when an already trained network losses 
part of its knowledge when either, it is re-trained to 
perform a different task, called temporal cross-talk 
(Jacobs et al.,1991), or two or more different tasks at 
the same time, the effect being called spatial cross-talk 
(Jacobs, 1 990). Modular systems allow reusing modules 
in different activities, without re-implementation of the 
function represented on each different task (De Jong 
et al., 2004) (Garibay et al., 2004). 

Modularity 

From a computational point of view, modularity is 
understood as the property that some complex compu- 
tational tasks have to be divided into simpler subtasks. 
Then, each of those simpler subtasks is performed by 
a specialized computational system called a module, 
generating the solution of the complex task from 
the solution of the simpler subtask modules (Azam, 
2000). From a mathematical point of view, modular- 
ity is based on the idea of a system subset of variables 
which may be optimized independently of the other 
system variables (De Jong et al., 2004). In any case, 
the use of modularity implies that a structure exists in 
the problem to be solved. 

In modular systems, each of the system modules 
operates primarily according to its own intrinsically 
determined principles. Modules within the whole 
system are tightly integrated but independent from 
other modules following their own implementations. 
They have either distinct or the same inputs, but they 
generate their own response. When the interactions 
between modules are weak and modules act indepen- 
dently from each other, the modular system is called 
nearly decomposable (Simon, 1969). Other authors 
have identified this type of modular systems as sepa- 
rable problems (Watson et al., 1998). This is by far 
one of the most studied types of modularity, and it 
can be found everywhere from business to biological 
systems. In nearly decomposable modular systems, the 
final optimal solution of a global task is obtained as 
a combination of the optimal solutions of the simpler 
ones (the modules). 

However, the existence of decomposition for a prob- 
lem doesn't imply that sub-problems are completely 
independent from each other. In fact, a system may be 
modular and still having interdependencies between 
modules. It is defined a decomposable problem as a 
problem that can be decomposed on other sub-prob- 



lems, but the optimal solution of one of those problems 
depends on the optimal solution of some of the others 
(Watson, 2002). The resolution of such modular sys- 
tems is more difficult than a typical separable modular 
system and it is usually treated as a monolithic one in 
the literature. 

Module 

Most of the works that use modularity, use the defini- 
tion of module given by (Fodor, 1983), which is very 
similar to the concept of object in object-oriented pro- 
gramming: a module is a domain specific processing 
element, which is autonomous and cannot influence 
the internal working of other modules. A module can 
influence another only by its output, this is, the result of 
its computation. Modules do not know about a global 
problem to solve or global tasks to accomplish, and 
are specific stimulus driven. The final response of a 
modular system to the resolution of a global task, is 
given by the integration of the responses of the different 
modules by a especial unit. The global architecture of 
the system defines how this integration is performed. 
The integration unit must decide how to combine the 
outputs of the modules, to produce the final answer of 
the system, and it is not allowed to feed information 
back into the modules. 



MODULAR NEURAL NETWORKS 

When modularity is applied for the design of a modular 
neural network (MNN) based controller, three general 
steps are commonly observed: task decomposition, 
training and multi-module decision-making (Auda and 
Kamel, 1999). Task decomposition is about dividing 
the required controller into several sub-controllers, and 
assigning each sub-controller to one neural module. 
Modules should be trained either, in parallel, or in dif- 
ferent processes following a sequence indicated by the 
modular design. Finally, when the modules have been 
prepared, a multi-module decision making strategy is 
implemented which indicates how all those modules 
should interact in order to generate the global controller 
response. This modularization approach can be seen 
as at the level of the task. 

The previous general steps for modularity only 
apply for a modularization of nearly decomposable or 
separable problems. Decomposable problems, those 
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where strong interdependencies between modules 
exist, are not considered under that decomposition 
mechanism, and they are treated as monolithic ones. 
The article introduces the differentiation between 
two modular levels, the current modularization level, 
which concentrates on task sub-division, and a new 
modularization performed at the level of the devices 
or elements. Those approaches are called strategic and 
tactical, respectively. 

Strategic and Tactical Modularity 

Borrowing the concepts from game theory, strategy 
deals with what has to be done in a given situation in 
order to perform a task by dividing the global target 
solution into all the sub-targets required to accomplish 
the global one. Tactics, on the other hand, treats about 
how plans are going to be implemented, this means, 
how to use the resources available at that moment to 
accomplish each of those sub-targets. 

It is defined strategic modularity in neural control- 
lers as the modular approach that identifies which 
sub-goals are required for an agent in order to solve 
a global problem. Each sub-goal identified is imple- 
mented by a monolithic neural net. In contrast, tactical 
modularity in neural controllers is defined as the one 
that identifies which inputs and outputs are necessary 
for the implementation of a given goal, and it designs 
a single module for each input and output. In tactical 
modularity, modularization is performed at the level 
of the elements (any meaningful input or output of 
the neural controller) that are actually involved in the 
accomplishment of the task. 

To our extent, all the research based on neural 
modularity and divide-and-conquer principles, focus 
their division at the strategic level, that is, how to 
divide the global problem into its sub-goals. Then, 
they implement each of those sub-goals by means of 
a single neural controller, final goal being generated 
by combining the outputs of those sub-goals in some 
sense. The current paper proposes, first, the definition 
of two different levels of modularity, and second, the 
use of tactical modularity as a new level of modulariza- 
tion that allocates space for decomposable modularity. 
It is expected that tactical modularization will be able 
in the generation of complex neural controllers when 
many inputs and outputs must be taken into account. 
It will be confirmed below, where the use of the two 



types of modularity will be compared against mono- 
lithic approaches. 

Implementing Modularity 

Strategic modularity can be implemented by any of the 
modular approaches that already exist in the literature. 
See (Auda and Kamel, 1999) for a complete descrip- 
tion. Any of the modularization methods described 
there is strategic, although it was not given that name, 
and they can, in general, be integrated with tactical 
modularity. 

The term strategic is used for those modular ap- 
proaches in order to differentiate them from the new 
proposed modularity. 

Tactical modularity defines modularity at the level 
of the elements involved in the generation of a sub- 
goal. By elements, it is understand the inputs required 
to generate the sub-goal and the outputs that define 
the sub-goal solution. Each of those elements conform 
a tactical module, implemented by a simple neural 
network. That is, tactical modularity is implemented 
by designing a completely distributed controller com- 
posed of small processing modules around each of the 
meaningful elements of the problem. 

The schematics for a tactical module is shown in 
Figure 1. Tactical modules are connected to its as- 
sociated element, controlling it, and processing the 
information coming in, for input elements, or going out, 
for output elements. This kind of connectivity means 
that the processing element is the one that decides 
which commands must be sent to the output element, 
or how a value received from an input element must 
be interpreted. It is said that the processing element is 
responsible for its associated element. 

In order to generate a complete answer for the 
sub-goal, all the tactical modules are connected each 
other, output of each module being sent back to all the 
others. By introducing this connectivity, each module 
is aware about what the others are doing, allowing that 
the different modules coordinate for the generation of 
a common answer, and avoiding a central coordinator. 
The resulting architecture shows a completely distrib- 
uted MNN, where neural modules are independent but 
implement strong interactions with the other modules. 
Figure 2 shows an example of connectivity in the 
generation of a tactical modular neural controller for 
a simple system composed of two input elements and 
two outputs. 
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Figure 1. Schematics of a tactical module for one input element (left) and for one output element (right) 
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Figure 2. Connectivity of a tactical modular controller 
with two input elements and two output elements 
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Training tactical modules is difficult due to the 
strong relationships between the different modules. 
Training methods used in strategic modules based 
on error propagation are not suitable, so a genetic 
algorithm is used to train the nets, because it allows 
to find the networks weights without defining an error 
measurement, just by specifying a cost function (Di 
Ferdinando et al., 2000). 

Combination of Different Levels 

The use of one kind of modularity does not prevent, 
in principle, the use at the same time of the other type 
of modularity. In fact, strategic and tactical modular- 
ity can be used separately or in conjunction with each 
other. When the solution required from the controller 
is simple, then either, a strategic, or a tactical modular- 
ization can be used. In those cases, it is suggested that 
the selection of the kind of modularity be based on the 
complexity of the problem. For simple problems with 
a small number of elements, a monolithic controller 
will fit it. Whether the number of elements is high, then 
a tactical modular controller will be the best option. 
Finally, for very complex tasks with many elements, 
a combination of strategic and tactical modularization 
could be preferable. 
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When combining both levels in one neural controller, 
the strategic modularization should be first performed, 
for identifying the different sub-goals required for the 
implementation. Next, a tactical modularization should 
be completed, implementing each of those sub-goals 
by a group of tactical modules. The number of tacti- 
cal modules for each strategic module will depend on 
the elements involved in the resolution of the specific 
sub-goal. 

Application Examples 



strategic, tactical and a combination of both. The results 
showed that the combination of both levels obtained 
the better results (see Figure 3). 

On additional experiments, tactical modularity was 
implemented for an Aibo robot. In this case, 3 1 tactical 
modules were required to generate the controller. The 
controller was generated to solve different tasks like 
stand up, standing and pushing the ground (Tellez et 
al., 2005). The controller was also able to generate one 
of the first MNNs controller able to make Aibo walk 
(Tellez et al., 2006). 




So far, strategic and tactical modularity have been 
mainly applied to robot control. The input elements 
are sensors and the output elements are actuators. In a 
first experiment, tactical modularity was applied to the 
control of a Khepera robot learning to solve the garbage 
collector problem (Tellez and Angulo, 2006) (Tellez 
and Angulo, 2007). It involved the coordination of 1 1 
elements (seven sensors and four actuators), creating 
11 tactical modules. The task was compared with dif- 
ferent levels of modularization, including monolithic, 



FUTURE TRENDS 

Within the evolutionary robotics paradigm, it is very 
difficult to generate complex behaviours when the robot 
used is quite complex with a huge number of sensors 
and actuators. The use of tactical modularity together 
with strategic one, is introduced as a possible solution 
to the problem of generating complex behaviours in 
complex robots. Even if some examples have been 



Figure 3. This figure represent the maximal performance value obtained by different types of modular approaches. 
Approach (a) is a monolithic approach, (b) and (c) are two different types of strategic approaches, (d) is tactical 
approach, and (f) is a reduced version of the tactical approach. 
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provided with a quite complex robot, it is necessary to 
see if the system can scale to systems with hundreds 
of elements. 

Additional applications include its use in more 
classical domains like pattern recognition, speech 
recognition. 



CONCLUSION 

The level of modularity in neural controllers can be 
highly increased if tactical modularity is taken into 
account. This type of modularity complements typical 
modularization approaches based in strategic modu- 
larizations, by dividing strategic modules into their 
minimal components, and assigning one single neural 
module to each of them. This modularization allows 
the implementation of decomposable problems within 
a modularized structure. Both types of modularizations 
can be combined in order to obtain a highly modular 
neural controller, which shows better results in complex 
robot control. 
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KEY TERMS 

Cost Function: A mathematical function used to 
determine how good or how bad has a neural network 
performed during the training phase. The cost function 
usually indicates what is expected from the neural 
controller. 

Element: Any variable of the program that contains 
a value that is used to feed into the neural network con- 
troller (input element) or to contain the answers of the 
neural network (output element). The input elements 
are usually the variables that contain the information 
from which the output will be generated. The output 
elements contain the output of the neural controller. 



Evolutionary Robotics : A technique for the creation 
of neural controllers for autonomous robots, based on 
genetic algorithms. 

Genetic Algorithm : An algorithm that simulates the 
natural evolutionary process, applied the generation of 
the solution of a problem. It is usually used to obtain 
the value of parameters difficult to calculate by other 
means (like for example the neural network weights). 
It requires the definition of a cost function. 

Modularization: A process to determine the sim- 
plest meaningful parts that compose a task. There is 
no formal process to implement modularization, and 
in practice, it is very arbitrary. 

Neural Controller: It is a computer program, based 
on artificial neural networks. The neural controller is 
a neural net or group of them which act upon a series 
of meaningful inputs, and generates one or several 
outputs. 
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INTRODUCTION 

In the last fifty years, approximately, advances in 
computers and the availability of images in digital 
form have made it possible to process and to analyze 
them in automatic (or semi-automatic) ways. Along- 
side with general signal processing, the discipline of 
image processing has acquired a great importance for 
practical applications as well as for theoretical inves- 
tigations. Some general image processing references 
are (Castleman, 1979) (Rosenfeld & Kak, 1982) (Jain, 
1989) (Pratt, 1991) (Haralick & Shapiro, 1992) (Russ, 

2002) (Gonzalez & Woods, 2006). 
Mathematical Morphology, which was founded by 

Serra andMatheron in the 1 960s, has distinguished itself 
from other types of image processing in the sense that, 
among other aspects, has focused on the importance of 
shapes. The principles of Mathematical Morphology 
can be found in numerous references such as (Serra, 
1982) (Serra, 1988) (Giardina & Dougherty, 1988) 
(Schmitt & Mattioli, 1993) (Maragos & Schafer, 1990) 
(Heijmans, 1994) (Soille, 2003) (Dougherty & Lotufo, 

2003) (Ronse, 2005). 



BACKGROUND 

Morphological processing especially uses set-based 
approaches, and it is not frequency-based. This is in 
fact in sharp contrast with linear signal processing 
(Oppenheim, Schafer, & Buck, 1999), which deals 
mainly with the frequency content of an input signal. 
Let us mention also that Mathematical Morphology (as 
the name suggests) normally employs a mathematical 
formalism. 

Morphological filtering is a type of image filtering 
that focuses on increasing transformations. Shapes can 
be satisfactorily processed by morphological filters. 
Starting with elementary transformations that are based 
on Minkowski set operations, other more complex trans- 
formations can be realized. The theory of morphological 
filtering is soundly based on mathematics. 



This article provides an overview of morphological 
filtering. The main families of morphological filters are 
discussed, taking into consideration the possibility of 
computing hierarchical image simplifications. Both 
the binary (or set) and gray-level function frameworks 
are considered. 

In the following of this section, some fundamental 
notions of morphological processing are discussed. The 
underlying algebraic structure and associated opera- 
tions, which establish the distinguishing characteristics 
of morphological processing, are commented. 



UNDERLYING ALGEBRAIC STRUCTURE 
AND BASIC OPERATIONS 

In morphological processing, the underlying algebraic 
structure is a complete lattice (Serra, 1988). A com- 
plete lattice is a set of elements with a partial ordering 
relationship, which will be denoted as <, and with two 
operations defined called supremum (sup) and infimum 
(inf): 

The sup operation computes the smallest element 
that is larger than or equal to the operands. Thus, 
if a, b are two elements of a lattice, "a sup b" is 
the element of the lattice that is larger than both 
a and b, and there is no smaller element that is 
so. 

The inf operation computes the greatest element 
that is smaller than or equal to the operands. 

Moreover, every subset of a lattice has an infimum 
element and a supremum element. 

For sets and gray-level images, these operations 
are: 

Sets (or binary images) 

o Order relationship: c=(set inclusion). 

o "A sup B" is equal to "A U B", where A and 

B are sets, 
o "A inf B" is equal to "A n B". 
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Gray-level images (images with intensity values 

within a range of integers) 

o Order relationship: For two functions f,g: 

f<g=>ftx)<g(x), 

for all pixel x 

where the right-hand-side < refers to the 
order relationship of integers, 
o The sup of f and g is the function: 

(f sup g)(x) = max {f(x), g(x)} 

where "max" denotes the computation of 
the maximum of integers, 
o The inf of f and g is the function: 

(fmfg)(x) = mm{f(x),g(x)} 

where "min" symbolizes the computation 
of the minimum of integers. 



TRANSFORMATION PROPERTIES 

The concept of ordering is key in non-linear morpho- 
logical processing, which focuses especially on those 
transformations that preserve ordering. An increasing 
transformation *P defined on a lattice satisfies that, for 
all a,b: 

a<b^> ¥(a) < ¥(b) 

The following two properties concern the order- 
ing between the input and the output. If I denotes an 
input image, an image operator *P is extensive if and 
only if, VI, 

!<¥(!) 

A related property is the anti-extensivity property. 
An operator *P is anti-extensive if and only if, VI, 

I>Y(J) 

The concept of idempotence is a fundamental notion 
in morphological image processing. An operator *P is 
idempotent if and only if, VI, 



X ¥(T) = X ¥ X ¥ (I) 

Within the non-linear morphological framework, 
the important duality principle states that, for each 
morphological operator, there exists a dual one with 
respect to the complementation operation. 

Two operators *P and Q are dual if 

¥ = CQC 

The complementation operation C, for sets, com- 
putes the complement of the input. In the case of gray- 
level images a related operation is the image inversion, 
which inverts an image reversing the intensity values 
with respect to the middle point of the intensity value 
range. 

The following concept of pyramid applies to multi- 
scale transformations. A family of operators {¥.}, where 
i g S = {l,...,n}, forms a multi-level pyramid if 

Vj, keS,j>k,3l such that ¥. = x ¥ l x V k 

In words, the set of trans formations {¥.} constitutes 
a pyramid if any level j of the hierarchy can be reached 
by applying a member of {¥.} to a finer (smaller index) 
level k. 



STRUCTURING ELEMENTS 

A structuring element is a basic tool used by morpho- 
logical operators to explore and to process the shapes 
and forms that are present in an input image. Normally, 
flat structuring elements, which are sets that define a 
shape, are employed. Two usual shapes (square and 
diamond) are displayed next (the "x" symbol denotes 
the center): 




-• 



_AZ 



(a) Square 3x3 



(b) Diamond 3x3 



If B denotes a structuring element, its transposed 
is B = {(-x-y) e B} (i.e., B inverted with respect to 
the coordinate origin). If a structuring element B is 
centered and symmetric, then B = B. 
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DILATIONS AND EROSIONS 

Dilations and erosions are the most basic transforma- 
tions in morphological processing. Dilations 5 are 
increasing operators that satisfy, VI, F, 

8(1 sup F) = 5(1) sup5(T) 

Respectively, erosions 8 are increasing operators 
that satisfy, VI, F, 

e(Iiirfr) = e(J)infe(f) 



B x symbolizes the structuring element B translated to 
point (or pixel) x (i.e., B x = {x' \ x' - x e B}, where "-" 
symbolizes the vector subtraction). 

Figure 1 shows a set example in R 2 . Input set A 
(composed of two connected-components) and structur- 
ing element B (a circle) are displayed in part (a). The 
5 B (A) dilation is shown in part (b). 

The previous expression can be formulated using 
the sup operation as: 



8 b( a )=IK= su p 



A 



Dilations and erosions by structuring element per- 
form, respectively, sup and inf operations over an input 
image that depend on a structuring element B. These 
dilations and erosions are symbolized, respectively, 
by 5 B and e B , and they originate from the Minkowski 
set addition and subtraction. Let us first discuss the 
set framework. 

In the set framework, if A denotes an input set, the 
5 B (A) dilation computes the locus of points where the 
B structuring element translated to has a non-empty 
intersection with (i.e., "touches") input set A: 

5 B (A)={x|B HA^0} 



Using the lattice framework, the expression for func- 
tions is formally identical to the expression for sets: 

W = su V b J- b 

Note: the sup operator is that of the function lattice. 

The function expression can be written in another 
way that gives a more operational expression to compute 
the value of the result of 5 B (I) at each pixel x of I: 



[8„(J)](x) = 



max bsB { ; ( x + b )} 



Figure 1. Dilation 5 (A) 



(a) Set A and structuring element B 



(b) MA) 
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The sup operation has been replaced by the "max" 
operation that computes the maximum of a set of 
integers. Note that "[5 B (J)](x)" is the intensity value 
of pixel x in that image. The "+" symbol denotes the 
vector addition. 

Some important properties of dilations 5 B are 
the following: 

For sets, dilation 5 B is commutative, i.e., if A 
denotes an input set, then 5 B (A) = 5 A (B). 
If a structuring element B contains the coordinate 
origin, then 5 B is extensive. 
The dilation by a structuring element is associa- 
tive, i.e, if B is the result of 5 C (D) (or 5 D (C)), then 
&JLQ = 8 C (8 D (J)) = 8 D (5 C (J)). 

If A denotes an input set and Bdenotes a structur- 
ing element, the e B (A) erosion computes the locus of 
points where the B structuring element translated to is 
completely included within input set A: 

s B (A)={x|B x czA} 

Figure 2 displays a set example of 8 B (A), where A 
is that of the previous dilation example. 

The following expressions of erosions 8 B are analo- 
gous to those already introduced for dilations. 



The expression for sets formulated by means of the 
inf operation is: 




beB 



b =ini b.B A -b 



The expressions for functions are: 

e B tf> = inf bsB J- b 

[8 B (/)](x) = min beB {7(x + b)} 

Some important properties of erosions 8 B are the 
following: 

If the coordinate origin belongs to a structuring 
element B, then 8 B is anti-extensive. 
The erosion by a structuring element is associa- 
tive, i.e, if B is the result of 5 C (D) (or 5 D (C)), then 
e B (i) = e c (e D (i)) = e D (e c (I)). 

In fact, expressions for erosions are dual of, re- 
spectively, those of dilations; 5 B and 8 B are dual of 
each other: 



5 b = Ce b C 



A simple 1-D example of a gray-level dilation and 
erosion (where B has 3 points) is the following: 



Figure 2. Erosion e (A) 



e B (A) 
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e fl (J) 



FROM SET OPERATORS TO FUNCTION 
OPERATORS 

Flat operators are function operators *P that can be de- 
rived from a set operator ¥' that satisfy the following 
threshold superposition property: 

[¥(J)](x) = sup{u:xe^([/ u (J))} 
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where U u is the thresholding operator at level u, and 
I is an image. The thresholding operator at level u is 
defined as 

U u (I)={x:I(x)>u} 

Let us define a variant of the thresholding opera- 
tor that outputs a binary function (instead of a set): 
(lT u (i))(x) is 1 if I(x) > u, and otherwise. Then, a flat 
operator *P that commutes with thresholding is said 
to satisfy: 

U ,x ¥ = W 



i.e., y B is the sequential composition of an erosion e B 
followed by a dilation 8 B , where B denotes B trans- 
posed. 

This type of filter first erodes an input image by 
the 8 B erosion, and then the subsequent S B dilation 
generally recovers in some sense the parts of the input 
image that have persisted. Nevertheless, not everything 
is normally recovered, and the output image is always 
less than or equal to the input image. 

The definition of the dual closing follows. A clos- 
ing by structuring element B, symbolized by (p B is 
defined by 



BASIC MORPHOLOGICAL FILTERS 
Openings and Closings 

In morphological processing, a filter is an increasing 
and idempotent transformation. 

The two most fundamental filters in morphological 
processing arise when there is an order between the input 
and the filter output. They are the so-called openings 
and closings, symbolized, respectively, by y and (p. 

An opening y is an anti-extensive morphological 
filter. 

A closing cp is an extensive morphological fil- 
ter. 

The names "algebraic openings" and "algebraic 
closings" are also used in the literature to refer to these 
most general types of openings and closings. 

The computation of openings and closings that use 
structuring elements as the "shape probes" to process 
input image shapes is discussed next. They are defined 
in terms of dilations and erosions by structuring ele- 
ment. 

For an input set A, an opening by structuring ele- 
ment B, symbolized by y B , is the set of points x that 
belong to a translated structuring element that fits a set 
A, i.e., that is included in A. Let us establish how y B 
is computed in the next definition, which applies both 
to sets and images. 

An opening by structuring element B, symbolized 
by y B is defined by 



Yi 



-■*B*B 



<P B = ^b 8 B 



i.e., (p B is the sequential composition of a dilation 
8 B followed by an erosion £ B , where B denotes B 
transposed. 

Alternated Filters 

The sequential compositions of an opening y and a clos- 
ing cp are called alternated sequential compositions. 

A morphological alternated filter is a sequential 
composition of an opening and a closing, i.e., 

(py and yep, 

are alternated filters. 

An important fact is that there is generally no 
ordering between the input and output of alternated 
filters, i.e., 

l£q>y(I)£l 

In addition, there is generally no ordering between 
(py and yep. 

Alternated filters are quite useful in image processing 
and analysis because they combine in some way the 
effects of both openings and closings in one filter. 

Parallel Combination Properties 

The class of openings (or, respectively, of closings) is 
closed under the sup (respectively, inf) operation. In 
other words: 
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The sup of openings is an opening. 
The inf of closings is a closing. 

Different structuring elements can be combined to 
achieve a desired shape filtering effect. For example, 
the effect of a sup of openings such as (y A sup y B ), 
which is itself an opening, can be quite different from 
either y A or y B . 



GRANULOMETRIES AND 
ANTI-GRANULOMETRIES 

The granulometry concept formalizes the size dis- 
tribution notion. Size distributions are families of 
transformations x ¥. with a size parameter z that satisfy 
the following axioms: 




Figure 3. Granulometry 





(a) Input image / 



(b) Y A (I) 




1 ■ 




■ ■ , ■ 


■ ■" ■ 


1 



(c) r 6 (i) 



(d) r % (i) 
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Increasingness 

Anti-extensivity 

Absorption 

If ¥., x ¥. belong to a size distribution, where i <j, 



then 



11/ VLf = XD XD — XD 

i J J i max(z,j) 



In morphological filtering, the so-called granulom- 
etries are families of transformations that satisfy the 
size distribution axioms above. A family of openings 
{y.}, where z e S = {l,...,n} is a granulometry if, for 
all z, 7 g S, 

z<j=>y.>y.. 

i.e., an ordered family of openings constitutes a granu- 
lometry. 

The dual concept of a granulometry is called an anti- 
granulometry, which is an ordered family of closings, 
as defined next. A family of closings {(p.}, where z e S 
= {l,...,n} is an anti-granulometry if, for all i,j e S, 



z<j =>cp.<cp.. 



Quite often, both a granulometry and an anti-granu- 
lometry are computed. 

Normally, quantitative increasing measures are 
computed at each output. These measure values build 
a curve that can be used to characterize an input image 
if the measure criterion is appropriate. 

M(j N (I) < ... > M (y^J) < M(i) < MCcp^i) < ... > M 

An example of an increasing criterion is the area 
or number of pixels in binary images, or the volume 
in non-binary images. 

To build a family of openings and a family of closings 
by structuring elements that constitute, respectively, 
a granulometry and antigranulometry, an appropriate 
family of structuring elements {iB | z e {0,...,iV}} that 
ensure the ordering of openings and closings is needed. 
Particularly, the family of structuring elements must 
satisfy the following property: 

Y (i _ 1)B (zB) = zB,fori>l. 



Figure 3 illustrates the granulometry concept. Three 
opening outputs have been displayed in parts (b), (c) 
and (d), particularly the outputs corresponding to open- 
ings y 4 , y 6 and y 8? where the subindex indicates the size 
of the structuring element (Inesta & Crespo, 2003). A 
subindex i refers to a square of side 2i+l. There is an 
ordering between the four images: part (a) > part (b) 
> part (c) > part (d). 



MULTI-LEVEL MORPHOLOGICAL 
FILTERING 

Alternating Sequential Filters 

Granulometries and anti-granulometries allow to build 
complex filters composed of ordered openings and 
closings. 

An alternating sequential filter ASF is an ordered 
sequential composition of alternated filters cp.y. or y.cp., 
such as 

ASF = (p.y....(p.y....(p 1 y 1 

ASF , . = y.(p r .y.(p....y 1 (p 1 

where z > j > 1 , and where y . and (p . belong, respectively, 
to a granulometry and an anti-granulometry. 

Alternating sequential filters satisfy the following 
absorption property: if z > j, then 



ASF ASF = ASF 

i j i 



ASF ASF < ASF 

ASF' ASF' = ASF' 

* j i 

ASF' ASF' > ASF' 

Morphological Pyramids 

Multi-level (or multi-scale) operators are families of 
transformations that depend on a scale parameter z. 
Within morphological filters, cases that satisfy the 
pyramid condition are: 

granulometries, 
anti-granulometries, and 
alternating sequential filters. 
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FUTURE TRENDS 

Operators that consider connectivity aspects have 
been an active research and work area in morphologi- 
cal processing. Connectivity integrates easily in the 
morphological filtering framework using the connected 
class concept (and the associated opening) introduced 
in (Serra, 1988). 

The class of connected filters (Serra & Salembier, 
1993) (Crespo, Serra, & Schafer, 1993) (Vincent, 1993) 
(Salembier & Serra, 1995) (Crespo, Serra, & Schafer, 
1995) (Breen & Jones, 1996) (Crespo & Schafer, 
1997) (Crespo & Maojo, 1998) (Garrido, Salembier, 
& Garcia, 1998) (Heijmans, 1999) (Crespo, Maojo, 
Sanandres, Billhardt, & Munoz, 2002) (Crespo & 
Maojo, 2008), which preserve shapes particularly 
well, has been successfully used in image processing 
and analysis applications. In more recent years, certain 
types of connected filters, such as the so-called levelings 
(Meyer, 1998) (Meyer, 2004), whose origin in the set 
framework can be traced back to (Crespo et al., 1993) 
(Crespo & Schafer, 1997), have been the focus of new 
research efforts. 



CONCLUSION 

This article has provided a summary of morphological 
filtering, which is qualitatively different from linear 
filtering. These differences are clear when morpho- 
logical filtering is approached analysing the underlying 
algebraic framework and the key importance of ordering 
and increasingness. 

Morphological filtering provides a distinctive 
type of image analysis that is appropriate to deal with 
shapes. Although in its origin morphological filtering 
was especially associated to set processing (and many 
concepts are originally set-based), it extends to non- 
binary gray-level functions. 
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KEY TERMS 

Duality: The duality principle states that, for each 
morphological operator, there exists a dual one. In 
sets, the duality is established with respect to the set 
complementation operation (see further details in the 
text). 

Extensivitity: A transformation is extensive when 
its output is larger than or equal to the input. Anti-ex- 
tensivity is the opposite concept: a transformation is 
anti-extensive when its output is smaller than or equal 
to the input. 

Idempotence: A transformation *P is said to be 
idempotent if, when sequentially applied twice, it 
does not change the output of the first application, 
i.e., x ¥ x ¥ = x ¥. 

Image Transformation: An operation that pro- 
cesses an input image and produces an output image. 
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Increasingness: A transformation is increasing Morphological Filter: An increasing and idempo- 

when it preserves ordering. If *P is increasing, then a tent transformation. 

< b => ^(a) < ¥(£>). 

Multi-Scale Transformation: Atransf ormation that 

Lattice: A complete lattice is a set of elements with displays some characteristics controllable by means of 

a partial ordering relationship and two operations called (at least) a parameter, which is called the size or scale 

supremum and infimum. parameter. 
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INTRODUCTION 

Since McCulloch and Pitts' seminal work (McCulloch 
& Pitts, 1943), several models of discrete neural net- 
works have been proposed, many of them presenting 
the ability of assigning a discrete value (other than 
unipolar or bipolar) to the output of a single neuron. 
These models have focused on a wide variety of ap- 
plications. One of the most important models was 
developed by J. Hopfield in (Hopfield, 1982), which 
has been successfully applied in fields such as pattern 
and image recognition and reconstruction (Sun et al., 
1 995), design of analogdigital circuits (Tank & Hopfield, 
1986), and, above all, in combinatorial optimization 
(Hopfield & Tank, 1985) (Takefuji, 1992) (Takefuji & 
Wang, 1996), among others. 

The purpose of this work is to review some appli- 
cations of multivalued neural models to combinatorial 
optimization problems, focusing specifically on the 
neural model MREM, since it includes many of the 
multivalued models in the specialized literature. 



BACKGROUND 

In Hopfield and Tank's pioneering work (Hopfield & 
Tank, 1985), neural networks were applied for the first 
time to solve combinatorial optimization problems, 
concretely the well-known travelling salesman problem. 
They developed two types of networks, discrete and 
continuous, although the latter has been mostly chosen 
to solve optimization problems, adducing that it helps 
to escape more easily from local optima. Since then, the 
search for better neural algorithms, to face the diverse 
problems of combinatorial optimization (many of them 



belonging to the class of NPcomplete problems), has 
been the objective of researchers in this field. 

This method of optimization consists of minimizing 
an energy function, whose parameters and constraints 
are obtained by means of identification with the objec- 
tive function of the optimization problem. In this case, 
the energy function has the form: 



1 JV N 



N N N 

w. .s.s. +V0.S, 

Z z=l j= 



where N is the number of neurons of the network, w. . 
is the synaptic weight between neurons j and z, and 0. 
is the threshold or bias of the neuron z. 

In the discrete version of Hopfield's model, com- 
ponent s. of the state vector S = (s lV ..,s N ) can take 
values in M = {-1,1} (constituting the bipolar model) 
or in M = {0,1} (unipolar model). In the continuous 
version, A^ = [-l,l] or .A/f = [0,l]. This continuous 
version, although it has been traditionally the most 
used for optimization problems, presents certain in- 
conveniences: 

Certain special mechanisms, maybe in form of 
constraints, should be contributed in order to 
get that, in the final state of the network, all the 
components of state vector S belong to {-1, 1} 
or {0,1}. 

The traditional dynamics used in this model, 
implemented in a digital computer, does not 
guarantee the decrease of the energy function in 
every iteration, so it is not ensured that the final 
state is a minimum of the energy function (Galan- 
Marin, 2000). 
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However, the biggest problem of this model (the 
discrete as well as the continuous one) is the possibil- 
ity to converge to a non feasible state, or to a local 
(not global) minimum. Wilson and Pawley (1988) 
demonstrated, through massive simulations, that, for 
the travelling salesman problem of 10 cities, only 8% 
of the solutions were feasible, and most not good. 
Moreover, this proportion got worse when problem 
size was increased. 

After this, many works were focused on improving 
Hopfield's network: 

By modifying the energy function (Xu & Tsai, 

1991). 

By adjusting the numerous parameters present in 

the network, as in (Lai & Coghill, 1988). 

By using stochastic techniques in the dynamics 

of the network (Kirkpatrick et al., 1983) (Aarts 

& Korst, 1988). 

Particularly, researchers tried to improve the effi- 
ciency of Hopfield's network for the travelling salesman 
problem, achieving acceptable results, but inferior to 
Operations Research techniques (Takahashi, 1997). 
The reason for these disappointing results is that the 
linear formulation used by these techniques is a great 
advantage in comparison with neural networks, which 
unavoidably use a quadratic energy function, impeding 
the use of subpaths deletion techniques (Smith, 1996), 
and provoking the appearance of a bigger number of 
local minima. 

Another research line was devoted to the improve- 
ment of Hopfieldtype recurrent networks, and their ap- 
plication to diverse problems of optimization, in which 
some results proved to be better than those obtained 
by traditional Operations Research techniques (Smith 
& Krishnamoorthy, 1998). Takefuji's work (Takefuji, 
1992) (Lee et al., 1992)(Takefuji & Wang, 1996), with 
a great number of publications in international media, 
must be highlighted. Their results have been overcome 
by the OCHOM model (GalanMarin & MunozPerez, 
2001). 



MULTIVALUED DISCRETE RECURRENT 
MODEL. APPLICATION TO 
COMBINATORIAL OPTIMIZATION 
PROBLEMS 

Anew generalization of Hopfield's model arises in the 
works (MeridaCasermeiro, 2000) (MeridaCasermeiro 
et al. , 200 1 ), where the MREM (Multivalued REcurrent 
Model) model is presented. 

The Neural MREM Model 

This model presents two essential features that make it 
very versatile and that increase its applicability: 

The output of each neuron, s., is a value of the 

set M = {m 1 ,m 2 ,...,m L }, which is not necessarily 

numeric. 

The concept of similarity function f between 

neuron outputs is introduced. f(x,y) represents 

the similarity between neuron states x and y. 

This way, the energy function of this model is as 
follows: 

E(S) = -±-f j f jWi J(s i ,s j ) + f j 8 i (s i ) 

^ i=l j=l 1=1 

where 0, : M —> K. is a generalization of the thresholds 
of each neuron. 

The features mentioned above make that in this 
model certain optimization problems (as the travel- 
ling salesman problem), have a better representation 
than in the unipolar or bipolar Hopfield's models, and 
their successors. 

It is clear that MREM includes Hopfield's models 
(with outputs in M = {-1,1} or in M = {0,l}) if we 
consider the similarity function given by the product 
f(a,b) = ab. Other multivalued models, like MAREN 
or SOAR (Erdem & Ozturk, 1996) (Ozturk & Abut, 
1997), are also generalized by MREM. 

The dynamics for this network is chosen according 
to the problem to be tackled. 
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Application to Several Combinatorial 
Optimization Problems 



E ( S ) = I d ^ +d .M 



This multivalued model has been successfully applied 
to diverse optimization problems, outperforming the 
best-established algorithms. Several of these applica- 
tions can be found at (MeridaCasermeiro et al., 2003) 
(MeridaCasermeiro & LopezRodriguez, 2005) (Lopez- 
Rodriguez et al., 2006). 

These problems are typical representatives of the 
NPcomplete complexity class, indicating their degree 
of difficulty in resolution. 

The Travelling Salesman Problem 

Traveling Salesman Problem (TSP) is one of the most 
wellknown and studied combinatorial optimization 
problems due to its wide range of reallif e applications 
and intrinsic complexity. 

Reallife applications cover aspects such as automatic 
routing for robots and hole location in printed circuits 
design (Reinelt, 1994), as well as gas turbine checking, 
machine task scheduling or crystallographic analysis 
(Bland & Shallcross, 1987), among others. 

This problem can be stated as follows : given N cities 
X,...X and distances d. . between each pair of cities X 

V 7 N i,j r i 

and X, the objective is to find the shortest closed tour 
visiting each city once. 

In order to get the TSP solved by this neural model, 
two identifications must be done: 

A network state must be identified to a solution 
to the TSP: Since a solution to the N cities TSP 
can be represented as a permutation in the set of 
numbers {l,...,iV}, the net will be formed by N 
neurons, taking value in the set M = {l,...,iV}, 
such that state vector S = (s lV ..,s N ) represents a 
permutation of { 1 , . . . 9 N} . With this representation, 
s. = k means that /cth city will be visited in the zth 
place. 

The energy function must be identified to the total 
distance of a tour: If we let f(x,y) = -2d and 



w u = 



1, (j=i+l)v((i = tf)A(,/=l)) 



0, 



otherwise 



the total distance of the tour represented by state 
vector S. 

Computational dynamics is based on starting with 
a random feasible initial state vector and updating 
neuron outputs to keep the current state vector inside 
the feasible states set. To this end, at each iteration, a 
2opt update will be made on current state vector, that 
is, every pair of neurons, p,q withp > q + 1, is studied 
and checked in parallel whether there exists a cross 
between segments (s , s +1 ) and (s , s +1 ). In this case, 
the next relation holds: 



+ d. 



>d c . + d 



Then, the trajectory between cities s p+1 and s q is 
inverted, that is, if S is the current state, the new state 
vector S' will be defined by: 



s; = 



q+p+l-i ' 



p + l<i<q 
otherwise 



As an additional technique for improvement, it has 
also been considered 3opt updates: the tour is decom- 
posed into three consecutive arcs, A, B and C, which 
are then recombined in all possible ways: {ABC,AC 
B,AB'C,ABC',AB'C',AC'B, ACB', AC'B'}, where 
A' ,B ' ,C ' are the reversed arcs corresponding to A, B , and 
C, respectively. Note that {ABC,AB'C,ABC',AC'B'} 
are 2opt updates, so there is no need to check them 
again. 

The next state of the net will be the combination 
that decreases most the energy function. Further details 
in (MeridaCasermeiro et al., 2003). 

In (MeridaCasermeiro et al., 2003), some experi- 
mental results are provided, for problems from the 
TSPLIB repository (see Table 1). This model is com- 
pared against KNIES (Aras et al., 1999), a model based 
on Kohonen's self organizing map. MREM proved to 
outperform KNIES, obtaining in many cases almost 
optimal solutions. 



the energy function obtained is 
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Figure 1. Best solution found by MREM (left, error-13%) and optimal solution (right) 






Table 1. Results of KNIES and MREM for the TSP for some instances from TSPLIB 



Instance 


Optimum 


KNIES 
Best (%) 




MREM 






Best (%) 


Av. (%) 


t (sec) 


eil51 


426 


2.86 


0.23 


2.43 


3.12 


st70 


675 


1.51 


0.00 


1.89 


9.01 


eil76 


538 


4.98 


1.30 


3.43 


10.80 


rdlOO 


7910 


2.09 


0.00 


3.02 


61.70 


eillOl 


629 


4.66 


1.43 


3.51 


27.76 


linl05 


14379 


1.29 


0.00 


1.71 


28.83 


prl07 


44303 


0.42 


0.15 


0.82 


49.79 


prl24 


59030 


0.08 


0.00 


1.23 


59.51 


bierl27 


118282 


2.76 


0.42 


2.06 


66.29 


kroA200 


28568 


5.71 


3.49 


6.70 


318.44 



The Graph Partition Problem 

Let <5 = (V,£) be an undirected graph without self- 
connections. V = {v z . } is the set of vertices and 8 is 
the set of n e edges. For each edge (v^v^.) ee there is 



a weight c. . eR + . All weights can be expressed by a 
symmetric real matrix C, with c. . = when it does not 
exist the arc (v^v^). 

MaxCut Problem: to find a partition of V into K 
disjoint sets A. such that the sum of the weights of the 
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edges from 8, that have their endpoints in different 
elements of the partition, is maximum. Therefore, the 
function to maximize is 

Z Z cu 



To solve the MaxCut problem with MREM, we need 
N neurons, one per node in V . The output of neuron 
z, s ( .eM = {l,2,...,X}, will denote that zth node is 
assigned to A . 

o Si 

Since it is equivalent to maximize the cost of 
edges cut by the partition and to minimize the cost of 
edges with endpoints in the same set of the partition, 
the objective function can be modelled as an energy 



function by taking w. . = -2c. . and f(x,y) = 8 x (that 
is, f(x,y) = 1 if, and only if, x-y, otherwise it is 0), 
considering 9. = 0. 

The dynamics used in (MeridaCasermeiro & Lopez- 
Rodriguez, 2995) was named best2. 

best2 consists in getting the greatest decrease of the 
energy function by changing the state of only two neu- 
rons at each time. If neurons p and q are to be updated, 
energy increments AE(z, j) when s = z and s = j, for z, j 
g {1,...,X}, are computed. Then, the state of minimum 
increase is chosen as the new network state. 

By using this dynamics, in (MeridaCasermeiro & 
LopezRodriguez, 2995), the MREM model is compared 
against some other networks, like OCHOM (Galan- 
Marin & MunozPerez, 299 1 ), obtaining the best results 
in authors' experiments (see Table 2). 



Table 2. Results for MaxCut comparing MREM and OCHOM 



N 


dens 




MREM 




( 


DCHOM 






Best 


Av. 


t 


Best 


Av. 


t 




9,95 


276,8 


256,28 


9,95 


276,8 


242,15 


9,9923 




9,25 


1913,2 


979,84 


9,96 


999,6 


926,26 


9,9926 


59 


9,5 


1778,8 


1724,98 


9,96 


1778,8 


1694,44 


9,9933 




9,75 


2663,6 


2475,48 


9,95 


2646 


2432,47 


9,9936 




9,9 


2941,8 


2876,18 


9,96 


2949,4 


2865,83 


9,9931 




9,95 


999,2 


917,72 


9,15 


958,8 


867,64 


9,9964 




9,25 


3719,2 


3629,9 


9,14 


3725,5 


3571,24 


9,9986 


199 


9,5 


6711,6 


6637,98 


9,13 


6695,8 


6585,54 


9,9126 




9,75 


9816,2 


9524,1 


9,14 


9816,2 


9444,33 


9,9118 




9,9 


11348,8 


11215,96 


9,14 


11391,3 


11148,4 


9,9199 




9,95 


2999,8 


1933,6 


9,26 


1929,6 


1837,43 


9,9147 




9,25 


7999 


7897,16 


9,26 


7949,2 


7699,35 


9,9258 


159 


9,5 


14791,4 


14531,96 


9,24 


14658,4 


14489,5 


9,9299 




9,75 


21126,2 


29899,94 


9,22 


21124 


29997,6 


9,9252 




9,9 


24926 


24589,62 


9,22 


24859,7 


24533,1 


9,9256 




9,95 


3411,4 


3321,84 


9,38 


3499,5 


3316,28 


9,9276 




9,25 


13741 


13533,9 


9,35 


13617,9 


13439,7 


9,9468 


299 


9,5 


25759,8 


25599,18 


9,34 


25779,8 


25526,8 


9,9451 




9,75 


37938,6 


36789,2 


9,32 


36932 


36683,4 


9,9486 




9,9 


43584,8 


43296,26 


9,33 


43429,6 


43194,6 


9,9462 
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The 2Pages Graph Layout Problem 

In the last years, several graph representation problems 
have been studied in the literature. Most of them are 
related to the linear graph layout problem, in which 
the vertices of a graph are placed along a horizontal 
"node line", or "spine" (dividing the plane into two 
half planes or ^pages") and then edges are added to this 
representation as specifiedby the adjacency matrix. The 
objective of this problem is to minimize the number of 
crossings produced by such a layout. 

Some examples of problems associated to this linear 
graph layout problem (or 2 pages crossing number 
problem, 2PCNP) are the bandwidth problem (Chinn et 
al., 1982), the book thickness problem (Kainen, 1990), 
the pagenumber problem (Malitz, 1 994), the boundary 
VLSI layout problem (Ullman, 1984) and the singlerow 
routing problem (Raghavan & Sahni, 1983), or printed 
circuit board layout (Sinden, 1966) and automated 
graph drawing (Tamassia et al., 1988). 

In (LopezRodriguez et al., 2007), a neural model, 
derived from MREM, is designed to solve this problem. 
One of the differences of this model with the algorithms 



developed in literature is that there is no need of assign- 
ing a good ordering of the vertices at a preprocessing 
step. The model, as well as the relative position of the 
arcs, computes this optimal node order. 

To solve the 2PCNP problem, authors have consid- 
ered two MREM neural models: 

• The first network will be formed by N neurons, 
being iVthe number of nodes in the graph. Neurons 
output (the state vector) indicate the node ordering 
in the line. Thus, s = k will be interpreted as the 
/cth node being placed in the zth position in the 
node line. Hence, the output of each neuron can 
take value in the set A\ = {l,2,...,iV}. 
The second network will be formed by as many 
neurons as edges in the graph, M. The output of 
each neuron will belong to the set M^ = {-1,1} 
. For the arc (v., v.), S r = -1 will indicate that 
the edge will be drawn in the lower halfplane, 
and S, , = +1, in the upper one. 

(vj, vp > rr 

Initially, the state of the net of vertices is randomly 
selected as a permutation of {1,2,. ..,iV}. At any time, 




Table 3. Comparison between MREM and the heuristics mentioned in Cimikowski's work 



Graph 



N 



M 



MREM CN e-len 1-page greedy 



K 6 


6 


21 


3 


3 


3 


4 


5 


K 7 


7 


28 


9 


9 


11 


9 


13 


K 8 


8 


36 


18 


18 


18 


30 


27 


K 9 


9 


45 


36 


36 


42 


50 


50 


K 10 


10 


55 


60 


60 


80 


92 


80 


C 20 (l, 2) 


20 


40 





2 











C 20 (l, 2, 3) 


20 


60 


19 


24 


36 


48 


40 


C 20 (l, 2, 3, 


4)20 


80 


74 


74 


90 


118 


108 


C 22 (l, 2, 3) 


22 


66 


22 


26 


40 


54 


44 


22 V > > > 


7)22 


88 


198 


200 


306 


294 


286 


C 24 (l, 3) 


24 


48 


11 


14 


22 


16 


22 


C 26 (l, 3) 


26 


52 


11 


16 


24 


16 


24 


C 28 (l,3,5) 


28 


84 


80 


86 


138 


138 


130 


C 30 (l, 3, 5) 


30 


90 


92 


96 


148 


150 


140 
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Figure 2. Optimal layouts for graphs K (left) and K (right) 





the net is looking for a better solution than the current 
one, in terms of minimizing the energy function. This is 
achieved by permuting the output of two neurons (node 
positions) and changing the location of an edge (from 
the upper halfplane to the lower one, and viceversa). 
In (LopezRodriguez et al., 2007), this new model is 
compared against some heuristics (Cimikowski, 2002) 
specially designed for this problem. MREM obtained 
the best solutions in the experiments, improving the 
best known solution in some cases (Table 3). 



FUTURE TRENDS 

Recurrent neural networks can be used to solve many 
optimization problems. Researchers and practitioners 
can benefit from the application of the neural model 
MREM to diverse optimization problems. 

Other problems where these models can be ap- 
plied cover aspects such as data classification, image 
compression by vector quantization, etc. It must be 
noted that many graph-based problems can be easily 
formulated in terms of minimizing the energy function 
of this model: degreeconstrained minimum spanning 
tree, maximum clique, etc. 



CONCLUSION 

The first works in optimization by neural networks were 
inspired in Hopfield's models. These models did not 
obtain good results when compared to the wellknown 
Operations Research techniques. 

Many researchers focused on developing new neural 
models to improve the performance of Hopfieldtype 
networks in this kind of tasks. 

The problem of these binary models is that all the 
information given by the problem has to be specified 
by means of only two values ( {0, 1 } or {-1 , 1 } ), so some 
information is lost. 

Multivalued neural models are designed to repre- 
sent the information of the problem by means of more 
than two values, achieving a better representation of 
the problem. 

With this improvement, computational dynamics of 
multivalued models can be easily designed to solve a 
given optimization problem. These advantages make 
this kind of networks a very powerful ally in tackling 
combinatorial problems. 

The MREM model is a multivalued model that gen- 
eralizes many other models, so it can be easily used to 
solve optimization problems, as shown in the text. 
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Some applications of the model are wellknown 
NPcomplete optimization problems, like the Travel- 
ing Salesman Problem, the Graph Partition Problem, 
and the 2 Pages Crossing Number Problem. As shown 
in the references, this model is able to outperform 
the bestalgorithmuptodate in each of the mentioned 
problems. 
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KEY TERMS 

2 Pages Graph Layout Problem: Problem of find- 
ing an ordering of the nodes of a graph on a straight 
line, and assigning, to each edge, a location in any of 
the two halfplanes induced by that line, such that the 
number of crossings between edges is minimum. 

Artificial Neural Network: Structure for distrib- 
uted and parallel processing of information, formed by 
a series of units (which may possess a local memory 
and make local information processing operations), 
interconnected via one-way communication channels, 
called connections. 

Computational Dynamics: Updating scheme of 
the neuron outputs in a neural model. 

Energy Function: Objective function of the opti- 
mization problem solved by a neural model. 

MaxCut Problem: Problem of finding a partition of 
the set of nodes of a weighted graph, such that the sum 
of the costs corresponding to edges, with end-points in 
different sets of the partition, is maximum. 

Multivalued Discrete Neural Model: A model 
of neural networks in which neuron outputs may take 
value in the set M = {m v . . . , m L }, instead of M = {-1, 1} 
or .M = {0,1}. 

Travelling Salesman Problem: Problem of finding 
the shortest closed tour that visits a series of N cities. 
Each city must be visited exactly one time. 



1120 



1121 



Multilayer Optimization Approach for Fuzzy 
Systems 




Ivan N. Silva 

University of Sao Paulo, Brazil 

Rogerio A. Flauzino 

University of Sao Paulo, Brazil 



INTRODUCTION 

The design of fuzzy inference systems comes along 
with several decisions taken by the designers since is 
necessary to determine, in a coherent way, the number 
of membership functions for the inputs and outputs, and 
also the specification of the fuzzy rules set of the sys- 
tem, besides defining the strategies of rules aggregation 
and defuzzification of output sets. The need to develop 
systematic procedures to assist the designers has been 
wide because the trial and error technique is the unique 
often available (Figueiredo & Gomide, 1997). 

In general terms, for applications involving system 
identification and fuzzy modeling, it is convenient to 
use energy functions that express the error between the 
desired results and those provided by the fuzzy system. 
An example is the use of the mean squared error or 
normalized mean squared error as energy functions. 
In the context of systems identification, besides the 
mean squared error, data regularization indicators can 
be added to the energy function in order to improve the 
system response in presence of noises (from training 
data) (Guillaume, 2001). 

In the absence of a tuning set, such as happens 
in parameters adjustment of a process controller, the 
energy function can be defined by functions that con- 
sider the desired requirements of a particular design 
(Wan, Hirasawa, Hu & Murata, 2001), i.e., maximum 
overshoot signal, setting time, rise time, undamped 
natural frequency, etc. 

From this point of view, this article presents a new 
methodology based on error backpropagation for the 
adjustment of fuzzy inference systems, which can 
be then designed as a three layers model. Each one 
of these layers represents the tasks performed by the 
fuzzy inference system such as fuzzification, fuzzy 
rules inference and defuzzification. The adjustment 



procedure proposed in this article is performed through 
the adaptation of its free parameters, from each one of 
these layers, in order to minimize the energy function 
previously specified. 

In principle, the adjustment can be made layer by 
layer separately. The operational differences associated 
with each layer, where the parameters adjustment of a 
layer does not influence the performance of other, allow 
single adjustment of each layer. Thus, the routine of 
fuzzy inference system tuning acquires a larger flex- 
ibility when compared to the training process used in 
artificial neural networks. This methodology is inter- 
esting, not only for the results presented and obtained 
through computer simulations, but also for its generality 
concerning to the kind of fuzzy inference system used. 
Therefore, such methodology is expandable either to 
the Mandani architecture or also to that suggested by 
Takagi-Sugeno. 



BACKGROUND 

In the last years it has been observed a wide and cres- 
cent interest in applications involving logic fuzzy. 
These applications include from consumer products, 
such as cameras, video camcorders, washing machines 
and microwave ovens, even industrial applications 
as control of processes, medical instrumentation and 
decision support systems (Ramot, Friedman, Langholz 
& Kandel, 2003). 

The fuzzy inference systems can be treated as meth- 
ods that use the concepts and operations defined by the 
fuzzy set theory and by fuzzy reasoning methods (Sug- 
eno & Yasukawa, 1993). Basically, these operational 
functions include fuzzification of inputs, application of 
inference rules, aggregation of rules and defuzzifica- 
tion, which represents the crisp outputs of the fuzzy 
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system (Jang, 1993). At present time, there are several 
researchers engaged in studies related to the design 
techniques involving fuzzy inference systems. 

The first type of design technique of fuzzy inference 
system has its focus addressed to enable the modeling 
of process from their expert knowledge bases, where 
both antecedent and consequent terms of the rules are 
always fuzzy sets, offering then a high semantic level 
and a good interpretability capacity (Mandani & Assil- 
ian, 1 975). However, the applicability of this technique 
in the mapping of complex systems composed by 
several input and output variables has been an arduous 
task, which can produce as inaccurate results as poor 
performance (Guillaume, 2001)(Becker, 1991). 

The second type of design technique of fuzzy 
inference system can be identified as being those that 
incorporate learning, in an automatic way, from data 
that are representing the behavior of the input and 
output variables of the process. Therefore, this design 
strategy uses a collection of input and output values 
obtained from the process to be modeled, which differs 
of the first design strategy, where the fuzzy system was 
defined using only the expert knowledge acquired from 
observation on the respective system. In a generic way, 
the methods derived from this second strategy can be 
interpreted as being composed by automatic generation 
techniques of fuzzy rules, which use the available data 
for their adjustment procedures (or training). 



Among the main approaches belonging to this 
second design strategy, it has been highlighted the 
ANFIS (Adaptive -Network-based Fuzzy Inference 
Systems) algorithm proposed by Jang (1993), which 
is applicable to the fuzzy architectures constituted by 
real polynomial functions as consequent terms of the 
fuzzy rules, such as those presented by Takagi & Sugeno 
(1985) and Sugeno & Kang (1988). The more recent 
approaches, such as those proposed by Panella & Gallo 
(2005), Huang & Babri (2006) and Li & Hori (2006), 
are also belonging to this design strategy. 

However, the representation of a process through 
these automatic architectures can implicate in interpret- 
ability reduction in relation to the created base of rules, 
whose consequent terms are expressed in most of the 
cases by polynomial functions, instead of linguistic 
variables (Kamimura, Takagi & Nakanishi,1994). 

Thus, the development of adjustment algorithms of 
fuzzy inference systems, which the consequent terms 
of the fuzzy rules are also represented by fuzzy sets, 
has been widely motivated. 



MAIN FOCUS OF THE CHAPTER 

Considering the operational functions performed by the 
fuzzy inference systems, it is convenient to represent 
them by a three-layer model. Thus, the fuzzy inference 



Figure 1. Fuzzy inference system composed by two inputs and one output 
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system presented in this article can be represented by 
the sequential composition of three layers, i.e., input 
layer, inference layer and output layer. 

The input layer has functionalities of connecting 
the input variables (coming from outside) with the 
fuzzy inference system, performing their respective 
fuzzifications through proper membership functions. 
In the inference layer of the fuzzy rules, the input 
fuzzified variables are combined among them, accord- 
ing to defined rules, using as support the operations 
defined by the fuzzy theory. The resulting set of this 
aggregation process is then defuzzified to produce the 
fuzzy inference system output. The aggregation and 
defuzzification process of the fuzzy system output 
are both made by the output layer. It is important to 
observe, concerning to the output layer, that although 
it performs the two processes above described, it is 
also responsible for storing the membership functions 
of the output variables. As illustration, Fig. 1 shows 
the proposed multilayer model, which is constituted by 
two inputs and one output, having three fuzzy rules in 
its inference layer. 

In the following subsections further details will be 
presented about how fuzzy inference systems can be 
represented by a three-layers model. 

Input Layer 

The inputs fuzzification has the purpose of determining 
the membership degree of each input related to the fuzzy 
sets associated with each input variable. To each input 
variable of the fuzzy system can be associated as many 
fuzzy sets as necessary. In this way, let a fuzzy system 
constituted by only one input with N fuzzy sets, the 
output of the input layer will be a column vector with 
N elements, which are representing the membership 
degrees of this input in relation to those fuzzy sets. If 
we define the input of this fuzzy system with a unique 
input x, then the output of the input layer will be the 
vector I 2 represented by: 



r i (*)=[}% (*) ^a 2 (*) ••• VA N ( x a 



(1) 



where \i A (.) is the membership function defined 
to input x, which is referring to the /c-th fuzzy set 
associated with it. 

The generalization of the input layer concept for a 
fuzzy system having p input variables can be achieved 



if we consider each input being modeled as a sub-layer 
of the input layer. Taking into account this consider- 
ation, the output vector of the input layer J(x) is then 
defined by: 




r<*)=fr,fc) r h(^ T - hWl 



(2) 



where x. is the z'-th input of the fuzzy system and I k (.) 
is the /c-th vector of membership functions associated 
with the input x k . In Fig. 1 is illustrated the input layer 
for a fuzzy system composed by two inputs, which are 
mapped by the vectors l ± and I r 

There are several membership functions that can be 
used in the proposed approach. One of the necessary 
requisites for those functions is that they are normal- 
ized in the closed domain [0,1]. 

Inference Layer 

The inference layer of a fuzzy system has the function- 
ality of processing the fuzzy inference rules defined 
for it. Another functionality is to provide a knowledge 
base for the process. 

In this paper, the fuzzy inference system has initially 
all the possible inferred rules. Therefore, the tuning 
algorithm has the task of weighting the inference rules. 
The weighting of the inference rules is a proper way 
to represent the most important rules, or even to allow 
that conflicting rules are related to each other without 
any linguistic completeness loss. Thus, it is possible 
to express the z'-th fuzzy rule as follows: 



i?,.(l(x))=w,.r,.(l(x)) 



(3) 



where i?(.) is the function representing the fuzzy 
weighting of the z'-th fuzzy rule, w. is the weight of the 
z'-th fuzzy rule and r.(.) represents the fuzzy value of 
the z'-th fuzzy rule. In Fig. 1, it is shown the composi- 
tion involving r (.) and i? (.) for the three fuzzy rules 
belonging to the inference layer. 

Output Layer 

The output layer of the fuzzy inference system aims to 
aggregate the inference rules as well as the defuzzifica- 
tion of the fuzzy set generated from the aggregation of 
these inference rules. 

Besides the operational aspects, the aggregation 
and defuzzification methods must consider the requi- 



1123 



Multilayer Optimization Approach for Fuzzy Systems 



sites of hardware performance in order to reduce the 
computational effort needed for processing the fuzzy 
system. In this paper, the output layer of the inference 
system is also adjusted. The adjustment of this layer 
occurs in a similar way to that occurred with the input 
layer of the fuzzy system. As example, an illustration 
representing the procedures involved with the output 
layer is also shown in Fig. 1. 

Adjustment of the Fuzzy Inference 
System 

Let a fuzzy system with two inputs, each one composed 
of three gaussian membership functions, with a total 
of five inference rules, and having an output defined 
by two gaussian membership functions. It is known 
that, for each gaussian membership function, two free 
parameters should be considered, i.e., the mean and the 
standard deviation. Consequently, the number of free 
parameters of the input layer is 12. For each inference 
rule, a weighting factor has been associated, resulting 
a total of 5 free parameters in the inference layer. In 
relation to the output layer, the same considerations 
used for the input layer are valid. Therefore, four free 
parameters are associated with the output layer. 

Therefore, the mapping f between the input space 
x and the output space y may be defined by: 



y= f(x>mf In ,w,mf 0ut ) 



(4) 



where mf In is the parameter vector associated with the 
input membership functions, w is the weight vector of the 
inference rules, and mf om is the parameter vector associ- 
ated with the output membership functions. Therefore, 
mf In , w and n\f om represent the free parameters of the 
fuzzy system, which can be rewritten as follows: 



f(x,Q) 



(5) 



where is the vector resulting from concatenation of the 
free parameters involved with the fuzzy system, i.e. 



©=L m //n r >w r ,mf 0l /J 



(6) 



The energy function to be minimized, considering 
the fixed tuning set {x,d}, is defined by: 



l-k*yp) 



(7) 



where E, represents the energy function associated with 
the fuzzy inference system f. 

Unconstrained Optimization Techniques 

Let an energy function £ (0) differentiable in rela- 
tion to free parameters of the fuzzy inference system. 
Thus, the objective is to find an optimum solution 0* 
subject to: 



$(e )<^(0) 



(8) 



Therefore, we can observe that to satisfy the condi- 
tion expressed in (8), it is necessary to solve an uncon- 
strained optimization problem to obtain the solution 
0*, which is given by: 



0* = arg nun 2,(0 ) 





O) 



The condition that expresses the optimum solution 
in (9) can also be rewritten as follows: 

V^(0*) = O 

where V is the gradient operator defined by: 



v$(e)= 



de 1 'd@ 2 ''"'d@ m 



(10) 



There are several techniques used to solve uncon- 
strained optimization problems. A detailed description 
of these methods can be found in Bertsekas (1999). The 
selection of the most proper method is related to the 
complexity associated with the energy function. For 
example, the Gauss-Newton method for unconstrained 
optimization can be more applicable in problems where 
the energy function is defined by: 



m 

^(0) = -^e 2 (z) 



z=l 



(ii) 
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where e(z) is the absolute error in relation to the z-th 
tuning pattern. 

In this paper, a derivation of the Gauss-Newton 
method is used for tuning fuzzy inference system, 
which is defined by the expression following: 



e„ 



e n 



#''>*< 



(12) 



where g is the gradient of £, expressed in (1 1) and J is 
the Jacobean matrix of e defined in (12). The optimi- 
zation algorithm used was the Levenberg-Marquardt 
method (Marquardt, 1 963), which can efficiently handle 
ill-conditioned matrices J T J by altering equation (13) 
as follows: 



-(j T J + Xl) \ 



©next = ©now --V J + XI ) 9 



(13) 



The calculation of the matrices J and the vec- 
tors g were performed through the finite differences 
method. 

Simulation Results 

This section presents simulation results of the proposed 
methodology for the Mandani fuzzy model. In the two 
examples following the fuzzy system is used to model 
nonlinear functions. In the first example, a fuzzy infer- 
ence system is used to predict the Mackey-Glass time 
series. In the second example, a two-dimensional sine 
function is modeled by the fuzzy inference system. 

Example 1: Modeling the Mackey-Grass 
Function 

Using the adjustment methodology presented in this 
paper, a fuzzy inference system of Mandani type was 
developed with objective to predict the Mackey-Glass 
time series (Mackey & Glass, 1977), which is defined 
by: 



dx(t) , . . a-x(t-x) 

— — = -b ■ x(t) + - 

dt l + x(t-x) c 



(14) 



where the values of the constants are usually assumed 
as a = 0.2, b = 0.1 and c = 10. The value for the delay 
constant t was 17. The tuning set was constituted by 



500 patterns. The input variables of the fuzzy infer- 
ence system were four, which correspond to values x(t 
- 18), x(t - 12), x(t - 6) and x(t). As output variable 
was adopted x(t + 6). 

The fuzzy inference system was defined having 4 
fuzzy sets attributed to each input variable and also to 
the output variable. A total of 64 inference rules have 
been used in the inference process. 

The energy function of the system was defined 
as being the mean squared error between the desired 
values x(t + 6) and the values x(t + 6), i.e. 




1 l 



i=l 



(15) 



where L is the number of data used in the tuning pro- 
cess (L=500). 

After minimization of (16), the membership func- 
tions of the fuzzy inference system were adjusted as 
illustrated in Fig. 2. 

In Fig. 3 is presented the prediction results pro- 
vided by the fuzzy inference system for 1000 sample 
points. 

The mean squared error of estimation for the pro- 
posed problem was 0.000598 with standard deviation 
of 0.02448. The prediction error for the 1000 sample 
points is shown in Fig. 4. 

For comparison, it was developed a fuzzy inference 
system adjusted by the ANFIS (Adaptive Neural-Fuzzy 
Inference System). This fuzzy inference system was 
composed by 10 membership functions for each input, 
being the knowledge base constituted by 10 rules. The 
mean squared error of estimation for the proposed prob- 
lem was 0.000165 with standard deviation of 0.0041. 

Example 2: Modeling the Two-Input Sine 
Function 

In this example is used the proposed methodology to 
model a two-dimensional sine function defined by: 



z = sinc(x,y) 



sin(x) • sin(y) 



(16) 



From uniformly distributed grid points into the input 
range [-10,10] x [-10,10] of (17), 225 tuning data pairs 
were obtained. The fuzzy inference system used here 
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Figure 5. Tuning data (a) and reconstructed surface (b) 
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contains 11 rules, with 8 membership functions assigned 
to input variable x, 7 membership functions assigned to 
input y and 3 membership functions assigned to output 
z. The tuning data and the reconstructed surface are 
illustrated in Fig. 5. 



and their capabilities for knowledge representation, 
exploiting the tolerance for imprecision and uncertainty 
to summarize data and focus on decision-relevant 
information. 



FUTURE TRENDS 

The methodology for the adjustment of fuzzy inference 
systems presented in this article can be considered very 
promising, not only for performance and precision 
obtained through computer simulations, but also for 
its interpretability in relation to the output variable, 
which is a highly desirable feature of a fuzzy system. 
Actually, it is the most prominent feature that distin- 
guishes fuzzy systems from many other modeling 
techniques. We think that fuzzy system adjustment 
architectures, such as that proposed here, are ideally 
suited for explaining solutions to users because both 
premises (antecedents) and consequences of the rules 
are defined by fuzzy sets. 

Future research and application should return to and 
concentrate on the linguistic features of fuzzy systems 



CONCLUSION 

In this article was underlined the basic foundations 
involved with the adjustment process of fuzzy inference 
systems from unconstrained optimization techniques. In 
order to become the more efficient tuning, it is neces- 
sary that the energy function is properly specified for 
the adjustment process. To validation of the proposed 
methodology, the results obtained by the proposed 
approach were compared to those provided from the 
ANFIS methodology, and also through of mathemati- 
cal modeling problems. The results obtained from this 
methodology offers new perspectives of researches 
related to the fuzzy inference systems, allowing thus 
that problems previously treated only by artificial 
neural networks may also be treated through fuzzy 
inference systems. 
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KEY TERMS 

Backpropagation Algorithm: Learning algorithm 
of ANNs, based on minimizing the error obtained from 
the comparison between the outputs that the network 
gives after the application of a set of network inputs 
and the outputs it should give (the desired outputs). 

Defuzzification: Process of producing a quantifiable 
result (crisp) in fuzzy logic. Typically, a fuzzy system 
will have a number of rules that transform a number 
of variables into a "fuzzy" result, that is, the result is 
described in terms of membership in fuzzy sets. 

Fuzzification: Process of transforming crisp values 
into grades of membership for linguistic terms of fuzzy 
sets. The membership function is used to associate a 
grade to each linguistic term. 

Fuzzy Logic: Type of logic dealing with reasoning 
that is approximate rather than precisely deduced from 
classical predicate logic. 

Fuzzy Rule: Linguistic constructions of type IF- 
THEN that have the general form "IF A THEN B", where 
A and B are (collections of) propositions containing 
linguistic variables. 



1128 



Multilayer Optimization Approach for Fuzzy Systems 



Fuzzy System: Approach of the computational Membership Function: Generalization of the 

intelligence that uses a collection of fuzzy member- indicator function in classical sets. In fuzzy logic, it 

ship functions and rules, instead of Boolean logic, to represents the degree of truth as an extension of valu- 

reason about data. ation. 
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INTRODUCTION 

One of the basic terms in information engineering is 
data. In our approach, data item is defined as representa- 
tion of an information atom stored in digital computers. 
Although an information atom can be considered as a 
subject-predicate- value triplet (Lassila, 1999), data is 
usually given only with its value representation. This 
fact can lead to definitions where data is just numbers, 
words or pictures without context. For example in 
(WO, 2007), data is given as information in numerical 
form that can be digitally transmitted or processed. 
It is interesting that we can often recognize that the 
term 'data' is used without any exact terminological 
definition with the effect that the term often remains 
confusing, sometimes even contradicting the definitions 
of the term presented. Sieber and Kammerer (2006) 
introduce a new interpretation of data containing sev- 
eral levels. The lowest level belongs to data instances 
that describe the form and appearance of symbols. The 
intermediate level is the level of representatives which 
includes the applied encoding system. The highest level 
is related to the meaning with context description. All 
three levels are needed to get to know the informa- 
tion atom. For example the symbol '36' in a database 
determines only the value and representation system, 
but not the meaning. To cover the whole information 
atom, the database should store some additional data 
items to describe the original data. The main purpose 
of semantic data models is to describe both context and 
the main structure of data items in the problem area. 
These additional data items are called metadata. It is 
important to see that: 

metadata are data, 
metadata are relative, and 
metadata describe data. 



Metadata constitute a basis for bringing together data 
that are related in terms of content, and for processing 
them further. They can be understood as a pre-req- 
uisite for intelligent and efficient administration and 
processing, and not least as a focused, formal means 
of providing relevant data. 



BACKGROUND 

In data management systems, the context of a value is 
usually defined with the help of a storage structure. An 
identification name (a text value) is assigned to each 
position of the structure. The description of storage 
(structure, naming and constraints) is called schema. 
A big problem of structural data modeling is that it can 
not provide all the information needed to understand 
the full context of the data. For example, a relational 
schema 

RT (NM INT, KNEV CHAR(20), RU DATE) 

alone is not enough to capture the meaning of the 
stored data items. 

The main building blocks to describe the context 
in semantic data models (SDM) are concepts and re- 
lationships. The first widely known structure oriented 
semantic models in database design are the Entity- 
Relationship (ER) model (Chen, 1976) and the EER 
(Thalheim, 2000) model. The ER model consists of 
three basic elements: entity (concept), relationship 
and attribute. The attributes are considered as structure 
elements of the entities, one attribute may belong to 
only one entity. The EER model is the extension of 
the ER model with IS_A and HAS_A relationships. 
Some other extensions are SIM, IFO and RM/T. One 
of the main drawbacks of structure oriented SDM is 
the limitations of expressive power. 
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Later, models like UML or ODL (Catell, 1997) were 
developed to cover the missing obj ect oriented elements . 
In the case of ODL, a class description can contain the 
following elements: attributes, methods, inheritance 
parameters, visibility, relationships and integrity rules. 
These models provide a powerful complexity for 
software engineering but they are not very flexible to 
describe data models of higher abstraction. 

Global investigations were focused on the SDM 
with simpler and more universal elements. The most 
widely known high level semantic models are semantic 
networks and ontology models. A semantic network 
is represented with a directed graph where the vertices 
are the concepts and the edges are the relationships. 
The main differences between ontology models and 
the traditional SDM are in the followings: there is no 
fixed structural hierarchy among the concepts, flexible 
relationships, independence from application domain, 
structure is mapped into a logical formula, it can be 
related to an inference engine. It is widely assumed that 
anything at a high level of information processing must 
be based on ontology (Sloman, 2003). Further details 
can be found on current applications of ontology among 
others in (Taniar, 2006). 

One of the first languages for ontology is RDF 
(Lassila, 1999). RDF is used to describe the concepts 
in a neutral, machine-readable format. According to 
the specification, the basic language elements are re- 
sources, literals and statements. There are two types of 
resources: entity resources and properties. A statement 
is a triplet (p,s,o), where p is a property, s is a resource 
and o is either a literal or a resource. In another ap- 
proach, p is called predicate, s is the subj ect and o is the 
object in the statement. As it can be seen, a statement 
corresponds to an information atom. 

A pioneer representative of the next generation of 
languages is OWL (Bechhofer, 2004) which can be 
considered as an extension of RDF, that contains extra 
elements to describe among others typing, property 
characteristics, cardinality and behavioral properties. 
The OWL-DL language is based on Description Logic 
that describes the structural relationships of the domain 
in a logic language, which enables automatic reason- 
ing and constraint checking in the system. The applied 
logic language is based on first-order predicate logic. 
The most widely used products related to OWL are 
Protege, Pellet and KAON2. 



MULTI-LAYERED SEMANTIC MODELS 
Multi- Layered Schemas 

In the case of systems with complex functionality, one 
way to reduce complexity is to build up a modular 
system. Modularization is a successful concept in all 
engineering areas. Modularization can be vertical or 
horizontal. Vertical modularization is called layering. 
The basic properties of a layered system are the fol- 
lowings: 

the elements are assigned to clusters (called lay- 
ers); 

there exists a hierarchical relationship between 
the clusters; 

the relationships within the clusters differ from 
the relationships between the clusters; 
the clusters cooperate with each other in the role 
of a client or of a server. 

Every layer offers a set of functionality where the 
functions are built upon the services of the underly- 
ing layers. In the case of a multi-layered system, the 
implementation can gain in cost reduction compared 
with a single-layer structure. Layering means modu- 
larization from the viewpoint of implementation and it 
has the following qualitative and quantitative benefits 
(Knoerschild, 2003): 

encapsulation (the layers are in great part self- 
contained, consistency) , 
independence, 

flexibility (the layers can be replaced without 
affecting the other layers), 
cost reduction (simplicity in testing and in design, 
reusability). 

The layered structure is a common technology 
nowadays among others in networking (Hnatyshin, 
2007), image processing (Sunitha, 2007), process 
control (Zender, 2007) and software development 
(Kreku, 2006). 
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Multi- Layered Nature of Human 
Recognition 

It was realized very early that human spatial cognition 
is based on a partially hierarchical conceptual view of 
space (McNamara, 1 986). It is usual to perform the map- 
ping of spatial environment with a semantic hierarchy 
(Kuipers, 2000). In the proposal of Sloman (2003), the 
internal representation of the spatial environment is 
implemented with a three-layered model. The lowest 
layer is called metric layer. It establishes an absolute 
frame of reference. It consists of a navigation graph that 
describes the important positions of the environment. 
In the next, topological layer, the navigational nodes 
are mapped into areas, where an area corresponds to a 
set of connected nodes. An area denotes a compound 
spatial concept. The highest level belongs to the con- 
ceptual layer. In this layer the areas are mapped to 
general abstract concepts. This level corresponds to 
the ontology layer that provides different relationships 
and a reasoning engine. 

According to the current H-Cogaff view of human 
information processing architecture (Sloman, 2003) the 
cognition system consists of several regions perform- 
ing concurrent activities. The regions are structured 
into hierarchies. The perception hierarchy can activate 
for example different concepts at the same time for a 
single sensor input image. Visual perception can de- 
tect different levels of structure and different levels of 
concepts. The developed multi-layer ontology model 
consists of three layers: the reactive, the deliberate and 
the meta-management layer. 

Also in artificial intelligence, the application of 
multi-layered structures has gained a larger popular- 
ity. In (Kamimura, 2003), the information theoretical 
competitive learning method was implemented with 
multi-layered networks to solve complex problems. 
Networks are composed of several competitive layers. 
In each competitive layer, information is maximized. 
This successive information maximization enables 
networks to extract features gradually. Experimental 
results confirmed that information can be maximized 
in multi-layered networks, and the networks can ex- 
tract features that cannot be detected by single-layered 
networks 



Multi -Layered Concept Models 

Traditional SDM models were intended to manage only 
single-layer structure. Neither of the original versions 
of ER, RDF or OWL uses layers in the model. On the 
other hand, it can be seen, that the layered structure 
has a lot of benefits: 

increase in simplicity of management, 
decrease in complexity, 
increase in flexibility, and 
increase in reusability. 

The first classic layered models for semantic net- 
works were developed in the 80 's . The strong influence 
of psychological theories in human cognition can be 
easily observed in the proposed models. The layered 
model of Thompson (1990) consists of five layers 
similar in Greenwald's (1988) model. The base layer 
represents sensor data (images, sounds, signs) and the 
temporal relationships between these data items. The 
next layer is devoted to basic concepts. The connection 
of a concept and of its sensory appearances may vary 
and may be very complex (transformations). This con- 
nection should perform more complex activity than just 
simple association. The next level is called the level 
of events. In this layer, the simple object instances are 
bound to series and sentences. This layer should contain 
the logic supporting ideas of time and causality. The 
next level generates abstract objects that cluster object 
instances together. The next model level describes the 
activities on the abstract concepts like planning and 
modeling. The highest level is related to the abstract 
concepts and abstract activities like reasoning and 
metadata management. 

In the proposal of Khosla (2004), the layering of 
ontology is strongly correlated with the functional lay- 
ering of the system. The paper describes the structure 
of a general soft-computing module. The most internal 
layer is the object layer to describe the data schema. 
On the top of the data layer are the distributed agent 
layer, the tool agent layer and the optimization agent 
layer. These layers perform among others preprocess- 
ing, transformation and decision making. The concept 
of multi-layering can be applied to the traditional data 
models, too. A layered UML model is represented 
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among others in (Kreku, 2006). The layers here also 
correspond to the different functional areas within the 
application. The three proposed layers are the compo- 
nent layer, the HW architecture layer and the platform 
architecture layer. 

In (Sunitha, 2007), the investigation is focused on 
the SDM part only. The semantic data model is divided 
into three layers. The bottom part is the concept mea- 
sure layer that contains the descriptions of the concepts 
themselves. The middle layer is used to store the re- 
lationships (like specialization, classification) among 
the concepts. The top layer is for the context-related 
knowledge elements describing the environment of 
the application field. 

In some other proposals, layering refers not to the 
functional structure but to the abstraction levels. In 
the classical UML (Terrasse, 2001), a four-layered 
metamodel architecture is used. The bottom layer is 
the object layer and the next layer is the model layer. 
On the top of the model layer is the metamodel layer. 
At the top, the meta-metamodel layer can be found. 
The metaobject facility (MOF) model is based on a 
layered, conceptual metamodel structure. The content 
of a conceptual layer describes the elements in the next 
layer down. Both UML and MOF are based on class 
oriented representations. In (Melnik, 2000), a three- 
layer abstraction model is defined for the semantic web . 
These layers are the syntax layer, the object layer and 
the semantic layer. 

The semantic data model presented in (Sieber, 
2006) was invented as a model for bridging the gap 
between data and semiotics in terms of Peirce with a 
special focus on technical documentation processes. 
The concerned knowledge can be shared between many 
people in different departments, different production 
locations (including different countries) and in differ- 
ent applications. As a consequence of this, in such a 
process nearly everyone has two roles: one of owning 
and dating knowledge; and one of searching and need- 
ing dated knowledge. For mathematical purposes this 
model was extended (Sieber, 2007) using a semantic 
network based upon a lattice of concepts. The result is 
a multi-layer semantic data model that can be used to 
visualize more general decoding and encoding process- 
ing between signals and (semantic) concepts. 



FUTURE TRENDS 

The term of layered ontology occurs very rarely in the 
literature. The main reason for this is that the basic 
ontology can describe any levels of abstractions. Thus 
a single layer can cover any levels of concepts. This 
monolithic structure will cause difficulties in integrat- 
ing existing ontologies, as the overlap of concepts is 
harder to detect. The current research proj ects on ontol- 
ogy are usually aimed at the development of accurate 
ontologies for the different application domains. The 
main current areas are: medical systems, geographical 
systems, linguistics, social studies, enterprise infor- 
mation systems, logic, knowledge representation and 
automatic reasoning. Very few proposals deal with the 
application of modular, multi-layered ontologies. In 
(Purao, 2005), for example a database design specific 
domain is analyzed, where three layers are defined: 
the core (local) level, the neighborhood level and the 
global domain level. 

Although the importance of a domain independent 
ontology is visible and clear for everybody, the current 
works seem to neglect this requirement. According to 
(Mikroyannidis, 2006), ontology management in most 
information systems is based on simplicity, ontology 
layering is rarely used, and the requirements for ontol- 
ogy evolution and integration are usually neglected. 
We can predict that the modularized, layered ontology 
models will get more attention in the near future. 



CONCLUSION 

Traditional semantic models are based on a single-layer 
structure. This approach is a drawback in the devel- 
opment of complex systems. As the model of human 
cognition is built on a multi-layer approach, and the 
goal of semantic models is to describe concepts of our 
world, the multi-layer models seem to be more accurate 
to create a global semantic model. The current semantic 
models, like UML or ontology provide some layering 
possibilities, but the detailed analysis of multi-layered 
semantic models is a task of the future. 
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KEY TERMS 

Data Model: A formal description language to de- 
scribe and to manipulate the investigated data instances. 
It contains three components: a static structural part, 
an integrity part and a manipulation part. 

Multi-Layered Data Model: A data model where 
the model elements are assigned to levels. In the model, 
a hierarchy is defined between the levels . Regarding the 
element-level relationships, the intra-level relationships 
differ from the inter-level relationships. 

Ontology: A semantic data model that describes the 
concepts and their relationships. It contains a controlled 
vocabulary and a grammar for using the vocabulary 
terms. The ontology enables to make queries and asser- 



tions and reasoning. The most popular form to describe 
ontology is RDF and OWL. 

OWL: A language to describe Web ontologies. It 
uses an XML format and it contains a formal descrip- 
tion logic component, too. It provides the following 
extra functionality: classification, type and cardinality 
constraints, thesauri, decidability. 

RDF: A semantic data model that describes the 
world with statements. A statement is a triplet having 
the following form: subject-predicate-object. 

Semantic Data Model: A high level data model. 
It is usually based on concepts and it uses a graphi- 
cal formalism. It contains only the key, the semantic 
properties of the data structure. It does not cover the 
details of the implementation. 

UML: A standardized general -purpose modeling 
language for object oriented software systems. It has 
a graphical notation and contains several diagrams: 
structure diagrams (class, object, component, pack- 
age) and behavioral diagrams (activity, use-case, state 
machine, interaction). 
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INTRODUCTION 

Multi-class pattern recognition has a wide range of 
applications including handwritten digit recognition 
(Chiang, 1998), speech tagging and recognition (Atha- 
naselis, Bakamidis, Dologlou, Cowie, Douglas-Cowie 
& Cox, 2005), bioinformatics (Mahony, Benos, Smith & 
Golden, 2006) and text categorization (Massey, 2003). 
This chapter presents a comprehensive and competitive 
study in multi-class neural learning which combines 
different elements, such as multilogistic regression, 
neural networks and evolutionary algorithms. 

The Logistic Regression model (LR) has been widely 
used in statistics for many years and has recently been 
the object of extensive study in the machine learning 
community. Although logistic regression is a simple and 
useful procedure, it poses problems when is applied 
to a real-problem of classification, where frequently 
we cannot make the stringent assumption of additive 
and purely linear effects of the covariates. A technique 
to overcome these difficulties is to augment/replace 
the input vector with new variables, basis functions, 
which are transformations of the input variables, and 
then to use linear models in this new space of derived 
input features. Methods like sigmoidal feed-forward 
neural networks (Bishop, 1995), generalized additive 
models (Hastie & Tibshirani, 1990), and PolyMARS 
(Kooperberg, Bose & Stone, 1997), which is a hybrid 
of Multivariate Adaptive Regression Splines (MARS) 
(Friedman, 1991) specifically designed to handle clas- 
sification problems, can all be seen as different non- 
linear basis function models. The major drawback of 
these approaches is stating the typology and the optimal 
number of the corresponding basis functions. 



Logistic regression models are usually fit by maxi- 
mum likelihood, where the Newton-Raphson algorithm 
is the traditional way to estimate the maximum likeli- 
hood a-posteriori parameters. Typically, the algorithm 
converges, since the log-likelihood is concave. It is 
important to point out that the computation of the 
Newton-Raphson algorithm becomes prohibitive when 
the number of variables is large. 

Product Unit Neural Networks, PUNN, introduced 
by Durbin and Rumelhart (Durbin & Rumelhart, 
1989), are an alternative to standard sigmoidal neural 
networks and are based on multiplicative nodes instead 
of additive ones. 



BACKGROUND 

In the classification problem, measurements x., i = 
1,2,. ..,/(, are taken on a single individual (or object), and 
the individuals are to be classified into one of J classes 
on the basis of these measurements. It is assumed that J 

is finite, and the measurements x. are random observa- 

7 i 

tions from these classes. A training sample D = {(x n , y n ); 
n = 1, 2,...,JV} is available, where x n = (x ln ,...,x kn ) is the 
vector of measurements taking values in Q cz R k , and 
y n is the class level of the nth individual. In this chapter, 
we will adopt the common technique of representing the 
class levels using a "1-of-J" encoding vector y = (y (1) , 
y (2) ,...,y (J) ), such asy (/) = 1 if x corresponds to an example 
belonging to class / andy (/) = otherwise. Based on the 
training sample, we wish to find a decision function 
C : Q — > {1,2,..., J} for classifying the individuals. In 
other words, C provides a partition, say D 1 ,D 2 ,...,D J , of 
Q, where D { corresponds to the /th class, / = 1,2,..., J, 
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and measurements belonging to D { will be classified as 
coming from the /th class. A misclassification occurs 
when a decision rule C assigns an individual (based on 
measurements vector) to a class j when it is actually 
coming from a class / ± j. 

To evaluate the performance of the classifiers we 
can define the Correctly Classified Rate by 

1 JV 



On the other hand, because of the normalization 
condition we have: 




±p(y m =i| x ,e)=i 



and the probability for one of the classes (in the pro- 
posed case, the last) need not be estimated (observe 
that we have considered f 7 (x, 8 7 ) = 0). 



where I(.) is the zero-one loss function. A good clas- 
sifier tries to achieve the highest possible CCR in a 
given problem. 

Suppose that the conditional probability that x 
belongs to class / verifies: 

p(y (/) =l|x)>0, /=l,2,...,J,xeQ 
and set the function: 

p(y (l) =i|x) 

f / (x,9 / ) = log ) { , /=l,2,...,J,xeQ 

p(y (J) =l\x) 

where 9, is the weight vector corresponding to class / 
and fj(x, Qj) = 0. Under a multinomial logistic regres- 
sion, the probability that x belongs to class / is then 
given by 



p(y (/) =l|x,e)=- 



exp/;(x,e z ) 



l ex pf,( x > e 



,/=l,2,...,J 



j=l 



MULTILOGISTIC REGRESSION AND 
PRODUCT UNIT NEURAL NETWORKS 

Multilogistic Regression by using Linear and Prod- 
uct-Unit models (MLRPU) overcomes the nonlinear 
effects of the covariates by proposing a multilogistic 
regression model based on the combination of linear 
and product-unit models, where the nonlinear basis 
functions of the model are given by the product of the 
inputs raised to arbitrary powers. These basis functions 
express the possible strong interactions between the 
covariates, where the exponents are not fixed and may 
even take real values. In fitting the proposed model, 
the non-linearity of the PUNN implies that the cor- 
responding Hessian matrix is generally indefinite and 
the likelihood has more local maximum. This reason 
justifies the use of an alternative heuristic procedure 
to estimate the parameters of the model. 

Non-Linear Model Proposed 

The general expression of the proposed model is given 
by: 



where 9 = (8 f , 9 2 ,...,9 J _ 1 ). 

The classification rule coincides with the optimal 
Bayes' rule. In other words, an individual should be 
assigned to the class which has the maximum prob- 
ability, given the vector measurement x: 

C(x) = / 

where 

/ =argmax / f l (x,Q l ) } for I = 1,...,J. 



f z (x,e / ) = ai+£a/x i +J;p]n^, /=1,2,...,J-1 



7=1 i=l 



where 



9, = (a 1 , p', W), 



a' =(ao,a 1 / ,...,a^), 



P'=(Pi Pi) and 

W = (w 1? w 2 ,..., w m ), 
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with w. = (w>, w. ,..., wA 



W.-E. 



As has been stated before, the nonlinear part of 
f z (x,O z ) corresponds to Product Unit Neural Networks 
(PUNN), introduced by Durbin and Rumelhart (Durbin 
& Rumelhart, 1989) and subsequently developed by 
other authors (Janson&Frenzel, 1 993), (Leerink, Giles, 
Home & Jabri, 1995), (Ismail & Engelbrecht, 2000), 
(Martinez-Estudillo, Hervas-Martinez, Martinez-Estu- 
dillo & Garcia-Pedrajas, 2006), (Martinez-Estudillo, 
Martinez-Estudillo, Hervas-Martinez & Garcia-Pedra- 
jas, 2006). Advantages of product-unit based neural 
networks include increased information capacity and 
the ability to form higher-order combinations of inputs. 
They are universal approximators and it is possible to 
obtain upper bounds of the VC dimension of product 
unit neural networks that are similar to those obtained 
for sigmoidal neural networks (Schmitt, 2001). Despite 
these obvious advantages, product-unit based neural 
networks have a major drawback. Their training is 
more difficult than the training of standard sigmoidal 
based networks (Durbin & Rumelhart, 1989). The 
main reason for this difficulty is that small changes 
in the exponents can cause large changes in the total 
error surface. Hence, networks based on product units 
have more local minima and a greater probability of 
becoming trapped in them. It is well-known (Janson & 
Frenzel, 1 993) that back-propagation is not efficient in 
training product-units. Several efforts have been made 
to develop learning methods for product units (Leerink, 
Giles, Home & Jabri, 1 995), (Martinez-Estudillo, Mar- 
tinez-Estudillo, Hervas-Martinez, & Garcia-Pedrajas, 
2006), mainly in a regression context. 

Estimation of the Model Coefficients 

In the supervised learning context, the components of 
the weight vectors 9 = (6 1 , B 2 ,..., 8 _ x ) are estimated 
from the training dataset D. To perform the maximum 
likelihood (ML) estimation of 9, one can minimize the 
negative log-likelihood function 



1 N 

Ntt 



Lm = ~f J \ogp(y n \x n ,e)= 

™ n=l 

■Z^/K^eo+iog^exp/i^e,) 



The error surface associated with the proposed 
model is very convoluted with numerous local opti- 
mums. The non-linearity of the model with respect to 
the parameters 9, and the indefinite character of the 
associated Hessian matrix do not recommend the use 
of gradient-based methods to maximize the log-likeli- 
hood function. Moreover, the optimal number of basis 
functions of the model (i.e. the number of hidden nodes 
in the product-unit neural network) is unknown. Thus, 
the estimation of the vector parameter § is carried out 
by means a hybrid procedure described below. 

In this paragraph we make a detailed description 
of the different aspects of the MLRPU methodology. 
The process is structured in four steps: 

Step 1. We apply an evolutionary neural network al- 
gorithm to find the basis functions 

B(x,W) = {B 1 (x,w 1 ),B 2 (x,w 2 ),...,B m (x,wj} 

corresponding to the nonlinear part of f(x,9). We 
have to determine the number of basis functions 
m and the weight vector W = (w p w 2 ,..., w m ). 

To apply evolutionary neural network techniques, 
we consider a PUNN with the following structure (Fig. 
1): an input layer with a node for every input variable, 
a hidden layer with several nodes, and an output layer 
with one node for each category. The activation function 
of the j-th node in the hidden layer is given by 

B j (x,w.)=n^ 



where w.. is the weight of the connection between in- 



put node z and hidden node j and w. 



w^,..., w jk ). The 



activation function of the output node / is given by 



g,(x,P , ,Q) = ft+£pJB ; (5,<o J ) 

7=1 

where p ! is the weight of the connection between the 
hidden node j and the output node /. The transfer func- 
tion of all output nodes is the identity function. 

The weight vector W = (w 1? w 2 ,..., wj is estimated 
by means of an evolutionary programming algorithm 
detailed in (Hervas-Martinez, Martinez-Estudillo & 
Gutierrez, 2006), that optimizes the error function 
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Figure 1. Model of a product-unit based neural network 




#i( x ) 



# 2 ( x ) 
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1,2,...,J-1 



1,2,..., m 



L2,...,fc 



given by the negative log-likelihood for JV observations 
associated with the product-unit model: 



L*(P,W) = -|-X -X^ ) g / (x n ,p / ,W) + loggexpg / (x n ,p / ,W) 

^ n=l /=1 /=1 



Step 3. We minimize the negative log-likelihood func- 
tion for N observations: 



L(a,P) : 



A N 

-X 

ivtr 



-Z^ ) (« /X n + P /Z n) + l0gZ eX P(« /X n + P /z J 



Although in this step the evolutionary process ob- 
tains a concrete value for the p vector, we only consider 
the estimated weight vector W = (w 1 ,w 2 ,...,w m ) that 
builds the basis functions. The value for the p vector 
will be determined in Step 3 together with the a coef- 
ficient vector. 

Step 2. We consider the following transformation of 
the input space by adding the nonlinear basis 
functions obtained by the evolutionary algorithm 
in step 1: 

H:R k ^R k+m 

[x 1 ,x 2 ,...,x k J— » (x 1 ,x 2 ,...,x k ,z 1 ,...,z m ) 



where x n = (1, x ln ,..., x kn ). Now, the Hessianmatrix 
of the negative log-likelihood in the new variables 
x v x 2 ,..., x k , z v ..., z m is semidefinite positive. Then, 
we could apply Newton's method, also known, in 
this case, as Iteratively Reweighted Least Squares 
(IRLS). Although there are other methods for 
performing this optimization, none clearly outper- 
forms IRLS (Minka, 2003). The estimated vector 
coefficient 8 = (a,p,W) determines the model: 



f,(x,e) = d^X«>-. + ZP]n x f J '> 

l =1,2,..., J -1 



i=i i=i 



where z 1 = B 1 (x,w 1 ),...,z m =B m (x,wJ. 
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Step 4. In order to select the final model, we use a 
backward stepwise procedure, starting with the 
full model with all the covariates, initial and 
PU covariates, and successively prune variables 
sequentially to the model until further pruning 
does not improve the fit. 

Application to Remote Sensing 

We have tested the proposed methodology in a real 
agronomical problem of precision farming, consist- 
ing of mapping weed patches in crop fields, through 
remote sensed data. 

Remote sensing systems can provide a large amount 
of continuous field information at a reasonable cost. 
Remote sensed imagery shows great potential in model- 
ling different agronomic parameters for its application 
in precision farming. One aspect of overcoming the 
possibility of minimizing the impact of agriculture 
on environmental quality is the development of more 
efficient approaches for crop production determina- 
tion and for site-specific weed management. Potential 
economic and environmental benefits of site-specific 
herbicide applications include reduced spray volume, 
herbicide costs and non-target spraying, and increased 
control of weeds, (Thompson, Stafford & Miller, 1991), 
(Medlin, Shaw, Gerard, & LaMastus, 2000). 

We face a mapping weed patches problem through 
the analysis of aerial photographs. Images and data sets 
have been given by the Precision Farming and Remote 
Sensing Unit of the Department of Crop Protection, 
Institute of Sustainable Agriculture (CSIC, Spain), 
whose members reported previous results in predict- 
ing Ridolfia segetum Moris patches, (Pena-Barragan, 
Lopez-Granados, Jurado-Exposito & Garcia-Torres, 
2007). The data analyzed correspond to a study con- 
ducted in 2003 at the 42 ha- farm Matabueyes, naturally 
infested by R. segetum. At a field study, the nature of 
2,400 pixels was determined, being them considered 
as ground-truth pixels: 800 pixels were classified as R. 
segetum and 800 pixels were classified as R. segetum 
free pixels. 

Input variables include the digital values of all 
bands in each available image, that is: Red (R), Green 
(G) and Blue (B), for June image, and R, G, B and 
Near Infrared (NIR) for May and July images. The 
experimental design was conducted using a stratified 
holdout cross-validation procedure, where the size of 
training set was approximately 0.7n (1,120 pixels) for 



the training set and 0.3n (480 pixels) for the generaliza- 
tion set, n being the size of the full dataset. 

In all experiments, the EA has been applied with 
the same parameters. SPSS 13.0 software (SPSS, 2005) 
was used for applying IRLS algorithm and in order to 
select the more significant variables in the final model, 
through a backward stepwise procedure. 

The models compared in the different experiments 
are the following: firstly, we extract the best PUNN 
model of the EA(EPUNN); secondly, we obtain standard 
Logistic Regression model using initial covariables 
(LR); finally, we apply Logistic Regression only over 
basic function extracted from EPUNN model (MLRPU) 
and over the same basic functions together with initial 
covariables (MLRLPU). 

Results 

Performance of each model has been evaluated using 
the Correctly Classified Ratio in the generalization set 
(CCR G ). In Table 1 we show the matrix results of clas- 
sification over train and generalization sets for the three 
classification problems and the four models proposed (one 
problem at each date, May, June and July, and four mod- 
els, EPUNN, LR, MLRPU and MLRLPU). Best CGR G 
results are obtained with MLRPU and MLRLPU at May 
and June, although at July EPUNN model yields the best 
results. At all dates, differences between standard LR and 
hybrid LR (MLRPU and MLRLPU) are very significant. 
Table 2 includes models obtained at the date that leads 
to better classification results, that is, at June. 

Using these models we can obtain the probability of 
R.segetum presence at all pixels of the image, including 
non ground-truth pixels. Figure 1, 2, 3 and 4 represents 
weed maps obtained using the four proposed models at 
June, EPUNN, LR, MLRPU and MLRLPU, respectively. 
Weed presence probability has been represented using 
a scale between white (minimum probability, nearly 0) 
and dark green (maximum probability, nearly 1). From 
these maps, the agronomical expert can decide what 
threshold probability value consider to apply herbicide. 
MLRLPU and MLRPU models clearly differentiate 
better between high weed density zones and weed free 
zones, so they have a higher interest from the point of 
view of intelligent site-specific herbicide application. 
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Table 1. Classification matrixes (Y=0, R.segetum absence; Y=l, R.segetum presence) at all dates, using best 
Evolutionary Product Unit Neural Network, EPUNN, Logistic Regression, LR (in italic), Logistic Regression 
only with Product Units, MLRPU, (in brackets), and Logistic Regression with initial covariables and Product 
Units, MLRLPU (in squared brackets) 








Training 






Generalization 




Phenological 














Stage 


Ground 


Predicted Response 




Predicted Response 




(Date) 


Truth 
Response 


Y=0 


Y=l 


CCR(%) 


Y=0 


Y=l 


CCR(%) 


Vegetative 


Y=0 


384 352 


176 208 


68.5 62.9 


164148 


76 92 


68.3 61.7 






(383) [394] 


(177) [166] 


(68.4) [70.4] 


(164) [168] 


(76) [72] 


(68.3) [70] 


(mid-May) 


Y=l 


133171 


427 389 


76.2 69.5 


65 69 


175171 


72.9 71.3 






(136) [141] 


(424)419] 


(75.7) [74.8] 


(67) [69] 


(173) [171] 


(72.1) [71.3] 




CCR(%) 






72.4 66.2 
(72.1) [72.6] 






70.6 66.5 
(70.2) [70.6] 


Flowering 


Y=0 


547 529 


13 31 


97.7 94.5 


236 226 


414 


98.3 94.2 






(547)[552] 


(13) [8] 


(97.7)[98.6] 


(237)[238] 


(3)[2] 


(98.8)[99.2] 


(mid-June) 


Y=l 


7 30 


553 530 


98.8 94.6 


2 12 


238228 


99.2 95 






(9) [11] 


(551)[549] 


(98.4) [98] 


(2) [2] 


(238)[238] 


(99.2)[99.2] 




CCR (%) 






98.2 94.6 
(98)[98.3] 






98.7 94.6 
(99)[99.2] 


Senescence 


Y=0 


443 296 


117 264 


79.1 52.9 


195 138 


45102 


81.2 57.5 






(443) [447] 


(117) [113] 


(79.1) [79.8] 


(425) [189] 


(55) [51] 


(88.5) [78.8] 


(mid- July) 


Y=l 


105 131 


455 429 


81.2 76.6 


52 60 


188 180 


78.3 75 






(111) [117] 


(449) [443] 


(80.2) [79.1] 


(53) [50] 


(187) [190] 


(77.9) [79.2] 




CCR(%) 






80.1 64.7 
(79.6) [79.5] 






79.8 66.3 
(79.6) [79] 



Table 2. Obtained models at June for the determination of R.segetum presence probability (P) in order to obtain 
weed patches maps 



Methodology #coef. Model 



EPUNN 

LR 

MLRPU 

MLRLPU 



8 
4 

7 



P = l/(l+exp(-(-0.424+75.419(V 4633 )+0.322(R 1888 ) +14.990(A 3496 V 3 
P = l/(l+exp(-(-0.694+8.282(A)-63.342(V)-11.402(R)))) 
P = l/(l+exp(-(-17.227+143.012(V 4633 ) 

+0.636(R 1 888 )+23.021(A 3 496 V- 3 415 )))) 
P = l/(l+exp(-(18.027+130.674(A)-133.662(V) 

-29.346(R)+353.147(V 4633 ) 
-3.396(B 3496 G- 3415 )))) 



')))) 
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FUTURE TRENDS 

Concepts exposed in this chapter offer the possibility 
of developing new models of multi-class generalized 
linear regression, by means of considering different 
types of basis functions (Sigmoidal Units, Radial Basis 



Functions and Product Units) for the non-linear part of 
the proposed model. Moreover, future research could 
include ordinal logistic regression models with dif- 
ferent basis functions or probit models with different 
basis functions. 



Figure 1. EPUNN R.segetum presence probability 
map 



Figure 2. LR R.segetum presence probability map 





Figure 3. MLRPU R.segetum presence probability 
map 



Figure 4. MLRLPU R.segetum presence probability 
map 
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CONCLUSION 

To the best of our knowledge, the approach presented 
in this paper is a study in multi-class neural learning 
which combines three tools used in machine learning 
research: the logistic regression, the product-unit neural 
network model and the evolutionary neural network 
paradigm. Logistic regression is a well-tested statistical 
approach that performs well in two-class classification 
and can naturally be generalized to the multi-class 
case. On the other hand, product-unit neural network 
models are an alternative to standard sigmoidal neural 
networks with the ability to capture non-linear interac- 
tion between the input variables. Finally, evolutionary 
artificial neural networks present an interesting platform 
for optimizing both network performance and archi- 
tecture simultaneously. The adequate combination of 
these three elements carried out in several steps in our 
proposal, provides a competitive methodology to solve 
classification problems. 
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KEY TERMS 

Artificial Neural Networks: A network of many 
simple processors ("units" or "neurons") that imitates 
a biological neural network. The units are connected 
by unidirectional communication channels, which 
carry numeric data. Neural networks can be trained 
to find nonlinear relationships in data, and are used 
in applications such as robotics, speech recognition, 
signal processing or medical diagnosis. 

Evolutionary Computation: Computation based 
on iterative progress, such as growth or development 
in a population. This population is selected in a guided 
random search using parallel processing to achieve the 
desired solution. Such processes are often inspired by 
biological mechanisms of evolution. 

Evolutionary Programming: One of the four 
major evolutionary algorithm paradigms, with no fixed 
structure or representation, in contrast with some of 
the other evolutionary paradigm. Its main variation 
operator is the mutation. 



Iteratively Reweighted Least Squares (IRLS): 

Numerical algorithm for minimizing any specified 
objective function using a standard weighted least 
squares method such as Gaussian elimination. It is 
widely applied in Logistic Regression. 

Logistic Regression: Statistical regression model 
for Bernoulli-distributed dependent variables. It is 
a generalized linear model that uses the logit as its 
link function. Logistic regression applies maximum 
likelihood estimation after transforming the dependent 
into a logit variable (the natural log of the odds of the 
dependent occurring or not). 

Precision Farming: Use of new technologies, such 
as global positioning (GPS), sensors, satellites or aerial 
images, and information management tools (GIS) to 
assess and understand in-field variability in agriculture. 
Collected information may be used to more precisely 
evaluate optimum sowing density, estimate fertilizers 
and other inputs needs, and to more accurately predict 
crop yields. 

Product Unit Neural Networks: Alternative to 
standard sigmoidal neural networks, based on multi- 
plicative nodes instead of additive ones. Concretely, 
the output of each hidden node is the product of all its 
inputs raised to a real exponent. 

Remote Sensing: Short or large-scale acquisition of 
information of an object or phenomenon, by the use of 
either recording or real-time sensing devices that is not 
in physical or intimate contact with the object (such as 
by way of aircraft, spacecraft, satellite, or ship). 
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INTRODUCTION 

Real world optimization problems are often too complex 
to be solved through analytical means. Evolutionary 
algorithms, a class of algorithms that borrow paradigms 
from nature, are particularly well suited to address such 
problems. These algorithms are stochastic methods 
of optimization that have become immensely popular 
recently, because they are derivative-free methods, are 
not as prone to getting trapped in local minima (as they 
are population based), and are shown to work well for 
many complex optimization problems. 

Although evolutionary algorithms have convention- 
ally focussed on optimizing single objective functions, 
most practical problems in engineering are inherently 
multi-objective in nature. Multi-objective evolutionary 
optimization is a relatively new, and rapidly expanding 
area of research in evolutionary computation that looks 
at ways to address these problems. 

In this chapter, we provide an overview of some of 
the most significant issues in multi-objective optimiza- 
tion (Deb, 2001). 



BACKGROUND 

Arguably, Genetic Algorithms (GAs) are one of the 
most common evolutionary optimization approaches. 
These algorithms maintain a population of candidate 
solutions in each generation, called chromosomes. Each 
chromosome corresponds to a point in the algorithm's 
search space. GAs use three Darwinian operators - 
selection, mutation, and crossover to perform their 
search (Mitchell, 1998). Each generation is improved 
by systematically removing the poorer solutions while 
retaining the better ones, based on a fitness measure. 
This process is called selection. Binary tournament 
selection and roulette wheel selection are two popular 
methods of selection. In binary tournament selection, 



two solutions, called parents, are picked randomly 
from the population, with replacement, and their fit- 
ness compared, while in roulette wheel selection, the 
probability of a solution to be picked, is made to be 
directly proportional to its fitness. 

Following selection, the crossover operator is ap- 
plied. Usually, two parent solutions from the current 
generation are picked randomly for producing offspring 
to populate the next generation of solutions. The off- 
spring are created from the parent solutions in such a 
manner that they bear characteristics from both. The 
offspring chromosomes are probabilistically subject to 
another operator called mutation, which is the addition 
of small random perturbations. Only a few solutions 
undergo mutation. Evolutionary Strategies (ES) forms 
another class of evolutionary algorithms that is closely 
related to GAs and uses similar operators as well. 

Particle Swarm Optimization (PSO) is a more recent 
approach (Clerc, 2005). It is modeled after the social 
behavior of organisms such as a flock of birds or a school 
of fish, and thus only loosely classified as an evolu- 
tionary approach. Each solution within the population 
in PSO, called a particle, has a unique position in the 
search space. In each generation, the position of each 
particle is updated by the addition of the particle's own 
velocity to it. The velocity of a particle, a vector, is then 
incremented towards best location encountered in the 
particle's own history (called the individual best), as 
well as the best position in the current iteration (called 
the global best). 



EVOLUTIONARY ALGORITHMS FOR 
MULTI-OBJECTIVE OPTIMIZATION 

Multi-Objective Optimization 

When dealing with optimization problems with multiple 
objectives, the conventional theories of optimality can- 
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Figure 1. An illustration of convergence and diversity concepts in multi-objective optimization algorithms. The 
objective functions f x and f 2 are to be minimized. 
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not be applied. Instead, the concepts of dominance and 
Pareto-optimality are used. Without a loss of generality, 
we will assume that the optimization problem involves 
the simultaneous minimization of several objectives 
only. If these objective functions are f.(.), i = 1,...,M, 
a solution x is said to dominate another solution y if 
and only if for all z, f.(x) < f.(y) with at least one of the 
inequalities being strict. In other words, x dominates^ if 
and only if x is as good asy for all objectives and better 
in at least one. This relationship is written x y y . In the 
set of all feasible solutions, that subset whose members 
are not dominated by any other in the set, is called the 
Pareto set. In other words, if S is the search space, the Pa- 
reto set Pis given by, P = {x € S | Vy g S, y y x is false} 
. The image of the Pareto set P in the M dimensional 
objective function space is called the Pareto front, F. 
Thus, F = {(f.ix), f 2 (x),... f M (x))| x e P} 

The goal of a multi-objective optimization algorithm 
is twofold. Firstly, its output, the set of non-dominated 
solutions in the population, must be as close to the true 
Pareto front as possible. This feature is called conver- 
gence. Secondly, in addition to good convergence, the 
multi-objective evolutionary algorithm should also 
yield solutions that sample the front at approximately 



regularly spaced intervals, a feature that is usually 
referred to as diversity. Outputs, where the solutions 
are clustered in a few regions of the front while other 
regions are either omitted or poorly sampled, are not 
desirable. Figure 1 illustrates the concepts of good 
convergence and diversity. 

In order to handle multi-objective optimization tasks, 
an evolutionary algorithm must be equipped to dis- 
criminate between solutions using either convergence 
or diversity as the criterion for comparison. When using 
convergence, the majority of current evolutionary algo- 
rithms make use of one of two basic ranking schemes 
that were originally put forth by Goldberg (Goldberg, 
1989). The first is a method that shall be referred to 
here as domination counting. Within a population of 
solutions, the rank of any solution is the number of other 
solutions in the population that dominate it. Clearly, 
the non-dominated solutions in the population are as- 
signed counts of zero. The second approach will be 
called non-dominated sorting. Here, ranks are assigned 
to each solution in a population, in such a manner that 
solutions that have the same rank do not dominate one 
another, each solution is assigned a lower rank than 
another that it dominates, and, in turn, is ranked higher 
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than ones dominating it. As before, non-dominated 
solutions in the population are assigned a rank of zero. 
Both concepts are illustrated in figure 2. The numbers 
corresponding to each solution in figure 2 (left) are the 
domination counts. The figure shows specifically the 
solution with a count of 3 being dominated by three 
others (with ranks 0, and 2). In figure 2 (right), the 
solutions with equal rank (the ranks being 0, 1 or 2) 
are grouped together. Solutions with a rank of are 
non-dominated ones. They dominate those with ranks 
1 or 2. When they are removed, those with ranks of 1 
are no longer dominated. Removing rank 1 solutions 
makes the rank 2 solution non-dominated. 

Multi-objective evolutionary algorithms must also 
be equipped with the ability to discern between solu- 
tions that are in sparser regions of the M dimensional 
space of objective functions, from those in denser ones. 
The three main approaches used to do so are illustrated 
in figure 3. The first of these methods is to consider a 
bounding hypercube around each solution in the ob- 
jective function space that does not enclose any other 
solution (Deb, Pratap, Agarwal & Meyarivan, 2002). 
Neighboring solutions will be located at some of the 
corners of this hypercube. This is shown in figure 3 
(left), where two solutions, a and b, have been enclosed 
by hypercubes. The perimeters of the hypercubes are 
considered as measures of diversity. Solutions whose 
bounding hypercubes have a larger perimeter are con- 
sidered to be located in sparser regions than those with 
smaller ones. The second approach (Knowles & Corne, 
2000) is to superimpose anM-dimensional hypergrid In 
the objective function space, and consider the number 
of solutions that stay in each of the hypergrid's cells 
as a measure of how dense the region around the cell 
is. In figure 3 (middle), since b occupies the same cell 
as another solution, whereas a does not, the latter is 



regarded as being placed in a sparser region. The last 
approach computes the k th nearest neighbor of each 
solution (Zitzler, Laumanns & Thiele, 200 1 ). This situ- 
ation is depicted in figure 3 (right), where the solutions 
a and b have been connected to their nearest neighbors 
(/c=l). Solutions, which lie at greater distances from 
their neighbors, are considered to be in sparser regions 
of the objective function space. 

Since elitism, the guaranteed survival of the fittest 
solutions per generation, shows faster convergence in 
single objective evolutionary algorithms, this feature 
has also been incorporated into most current multi- 
objective evolutionary algorithms. Elitism is ensured 
by means of an archive that stores the best solutions 
in each generation. Quite often, the archived solutions 
are reinserted back into the main population. Archiving 
is implemented via schemes that we will refer to, col- 
lectively, as elite preservation. 

A Few Recent Evolutionary Algorithms 

(z) NSGA-II (Deb, Pratap, Agarwal & Meyarivan, 
2002): NSGA-II (Non-dominated Sorting Genetic Al- 
gorithm) maintains a population and a separate archive, 
each of size N. In each generation, the population is 
merged with the archive. This merged set is then sub- 
ject to elite preservation. The new archive is filled by 
taking the N best-ranked solutions obtained from elite 
preservation. The same N individuals are also subject 
to tournament selection, crossover, and mutation to 
form the population for the next generation. 

Elite preservation is implemented in NSGA-II by 
using non-dominated sorting for convergence, and 
the hypercube method for diversity. It proposes an 
OtMN 2 ) subroutine for fast non-dominated sorting to 
assign ranks to the solutions in the merged population. 




Figure 2. Discriminating between solutions based on convergence 
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Figure 3. Discriminating between solutions based on diversity 
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Starting with the lowest ranked solutions, the archive 
is filled until it reaches its full capacity N. 

Diversity is invoked to further discriminate between 
mutually non-dominating solutions when not all solu- 
tions of an identical rank can be inserted into the archive. 
NSGA-II incorporates an algorithm of order 0(MN 
logfiV)) to do so. Because non-dominated sorting is the 
computationally rate limiting portion in NSGA-II, the 
overall complexity is O(MAP) per generation. 

(ii) SPEA-2 (Zitzler, Laumanns & Thiele, 2001): 
The SPEA-2 (Strength Pareto Evolutionary Approach) 
method is quite similar to NSGA-II. It also maintains a 
population and an archive of size N, merging both in the 
beginning of every generation and making use of elite 
preservation to identify the best to undergo crossover 
and mutation for the next generation. In SPEA-2, indi- 
vidual fitness are computed in a two-step manner, that is 
an improved version of the basic domination counting 
approach discussed earlier. First to be computed is the 
strength of each solution in the merged population, i.e., 
the number of solutions it dominates. Then, the raw 
fitness of each individual is computed as the sum of the 
strengths of all the solutions that dominate it. 

In (Zitzler, Laumanns & Thiele, 2001) it is argued 
that this method of computing fitness imparts SPEA-2 
with some capability to preserve diversity. However, 
in itself, raw fitness is inadequate to do so, and so a 
separate term is added to it, that explicitly takes di- 
versity into account. This second term added to each 
solution's fitness, is inversely related to the distance 
from the solution to its k th nearest neighbor in objective 
function space. The overall algorithm complexity of 
SPEA-2 is OiMN 3 ) per generation. 

(iii) MOPSO (Coello Coello, 2004): MOPSO 
(Multi-objective Particle Swarm Optimization) is an 



approach for fast multi-objective optimization that is 
based on PSO. It imposes population diversity by means 
of the M-dimensional hypergrid described earlier, and 
counting the number of solutions present in each of the 
hypergrid's cells. Solutions occupying cells with lower 
counts are preferred to those with higher counts. An 
even distribution of solutions along the Pareto front is 
achieved by biasing particles to update their velocities 
towards global best particles that are located in sparser 
cells of the hypergrid, i.e. with lower counts. This is 
done using a roulette wheel selection algorithm that 
picks a cell probabilistically using cell counts, such 
that the higher the cell count, the lower the probabil- 
ity of selection becomes. MOPSO also implements a 
mutation operator. 

(fv) ParEGO (Knowles, 2006): ParEGO (Parallel 
Efficient Global Optimization) has been explicitly 
designed for problems where evaluating the objective 
function is highly expensive in terms of computer 
time. Therefore, ParEGO converges in as few function 
evaluations as possible. The algorithm uses a Gaussian 
process model to approximate the fitness landscape that 
is learned adaptively using supervised learning. For 
further details one is referred to (Knowles, 2006). 

(vjFSGA(Kodum,Das,Welch&Roe,2004):FSGA 
(Fuzzy Simplex Genetic Algorithm) has a complexity 
of 0(M]SP) per generation similar to NSGA-II. It differs 
from NSGA-II and SPEA-2 in the method used for elite 
preservation. Ameasure called fuzzy dominance is used 
for the purpose. A solution that is not dominated by any 
other is assigned a fuzzy dominance of zero. The poorer 
a solution is, the higher the fuzzy dominance value it 
is assigned. FSGA's fuzzy dominance is a numerical 
method that not only uses Pareto-optimality, but also 
considers the degree to which one solution dominates 
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another, making effective use of differences between 
their values of the objective functions. It has been 
designed specifically so that FSGA can be hybridized 
readily with a local search algorithm. 

PAES (The Pareto Archive Evolutionary Strategy) 
andPESA(Pareto Envelope based SelectionAlgorithm) 
are other successful multi-objective evolutionary al- 
gorithms that make use of the hypergrid to measure 
diversity (Knowles & Corne, 2000, Corne, Jerram, 
Knowles & Oates, 2001). Another algorithm, RDGA 
(Rank Density based Genetic Algorithm) uses this 
method along with a ranking scheme wherein a non- 
dominated individual has unit rank and others are 
assigned one plus the sum of the ranks of all solutions 
that dominate them (Lu & Yen, 2003). Very recently, the 
use of fuzzy dominance has been successfully applied 
to another multi-objective PSO algorithm (Koduru, 
Das & Welch, 2007). 



Another direction of certain future interest is in 
multi-objective optimization using novel biological 
paradigms. Only a few multi-objective PSO algorithms, 
such as MOPSO, and fuzzy dominance based PSO 
method (Koduru, Das & Welch, 2007), have been pro- 
posed; consequently, there is great interest in devising 
better PSO search strategies within the evolutionary 
computation community. Another class of algorithms 
based on computations involved in the vertebrate im- 
mune system is emerging, called Artificial Immune 
Systems (AIS). Although a few multi-objective AIS 
algorithms have been proposed recently (cf. Coello 
Coello & Cortes, 2005), there is substantial scope for 
improvement in this direction. 

Other trends are in devising more difficult bench- 
mark test problems. Huband etal. have proposed recent 
benchmarks (Huband, Hingston, Barone & While, 
2006), and the performance of evolutionary methods 
for these functions need to be investigated. 




FUTURE TRENDS 

Multi-objective evolutionary optimization is a rapidly 
expanding, new field of research. Although several in- 
teresting approaches have been proposed in the recent 
literature, further investigation is necessary before 
multi-objective algorithms can truly address the needs 
of the application domains. 

One current research focus is in devising numeri- 
cal metrics to compare solutions. This is particularly 
useful when the problem contains a large number of 
objectives. In higher dimensional objective function 
space, it is less likely to find a solution that dominates 
another, i.e. is better than or equal to another in all 
objectives. Under these circumstances, comparing 
solutions that are already within the Pareto front is 
essential. One such method has been suggested re- 
cently (Farina & Amato, 2004). This method counts 
the number of objectives along which one solution is 
better than and worse than another, and proposed fuzzy 
metrics based on the counts. However, such ideas have 
yet to be incorporated within evolutionary algorithms. 
A related direction of research is in devising schemes 
to compare solutions in the presence of uncertainty in 
objective functions. This research has obvious practi- 
cal implications in engineering and other applications 
where measuring objectives such as cost, efficiency or 
expected lifetime are difficult tasks (Fieldsend, Everson 
& Singh, 2005). 



CONCLUSION 

We have provided an overview of the new and expand- 
ing field of multi-objective optimization, outlining 
some of the most significant approaches. We chose to 
describe NSGA-II and SPEA-2 as they are the most 
popular algorithms today. We also discuss the recent 
algorithm, ParEGO, which is very promising for some 
specialized applications as well as the even more recent 
FSGA, currently under development, which fills the 
need for hybrid multi-objective algorithms. Finally, we 
also have outlined MOPSO, which is based on a new 
evolutionary paradigm, PSO. Lastly, we address future 
trends in evolutionary multi-objective optimization to 
complete the discussion. 



REFERENCES 

Clerc, M., (2005). Particle Swarm Optimization. ISTE 
Press, UK. 

Coello Coello, C.A. (2004). Handling multiple objec- 
tives with particle swarm optimization. IEEE Transac- 
tions on Evolutionary Computation. 8(3): 256-279. 

Coello Coello, C.A. & Cortes N.C. (2005). Solving 
multiobjective optimization problems using an artificial 



1149 



Multi-Objective Evolutionary Algorithms 



immune system. Genetic Programming and Evolvable 
Machines. 6(2): 163-190. 

Corne, D.W., Jerram, N.R., Knowles, J.D., & Oates, 
M.J. (2001). PESA-II: Region based selection in evo- 
lutionary multiobjective optimization, In Spector etal., 
(editors), Proceedings of the Genetic and Evolutionary 
Computation Conference. 283-290. 

Deb, K. (2001). Multi-Objective Optimization Using 
Evolutionary Algorithms. Wiley: Chichester, U.K. 

Deb, K., Pratap, A., Agarwal, S., & Meyarivan T. 
(2002). A fast and elitist multi-objective genetic algo- 
rithm: NSGA-II. IEEE Transactions on Evolutionary 
Computation, 6(2): 182-197. 

Farina M. & Amato P. (2004). A fuzzy definition of 
"optimality" for many-criteria optimization problems. 
IEEE Transactions on Systems, Man, and Cybernetics 
PartA-Systems and Humans. 34(3): 315-326. 

Fieldsend, J.E., Everson, R.M. & Singh, S. (2005). 
Multi-objective optimization in the presence of uncer- 
tainty, IEEE Congress on Evolutionary Computation, 
1: 243-250. 

Goldberg, D.E., (1989). Genetic Algorithms in Search, 
Optimization, and Machine Learning. Addison- Wesley, 
Reading, MA. 

Huband, S., Hingston, P., Barone, L., & While L. 
(2006). A review of multiobjective test problems and 
a scalable test problem toolkit. IEEE Transactions on 
Evolutionary Computation. 10(5): 477-506. 

Knowles, J.D., & Corne, D.W. (2000). Approximating 
the nondominated front using the Pareto archived evolu- 
tion strategy. Evolutionary Computation. 8: 149-172. 

Knowles, J. (2005). ParEGO: a hybrid algorithm with 
on-line landscape approximation for expensive mul- 
tiobjective optimization problems, IEEE Transactions 
on Evolutionary Computation. 10(1): 50-66. 

Koduru, P. Das, S., Welch, S.M., & Roe, J. (2004). 
Fuzzy dominance based multi-objective GA-Simplex 
hybrid algorithms applied to gene network models. Pro- 
ceedings of the Genetic and Evolutionary Computing 
Conference, Seattle, Washington, Kalyanmoy Deb et 
al. (editors), Springer- Verlag, LNCS 3102: 356-367. 

Koduru, P., Das, S., & Welch, S.M. (2007). Multi- 
objective and hybrid PSO using 8-fuzzy dominance, 



Proceedings of the ACM Genetic and Evolutionary 
Computing Conference, London, UK. (Eds. Dirk 
Thierens et al.): 853-860. 

Lu, H., & Yen, G.G. (2003). Rank-density-based 
multiobjective genetic algorithm and benchmark test 
function study. IEEE Transactions on Evolutionary 
Computation, 7(4): 325-343. 

Mitchell, M. (1998). An Introduction to Genetic Algo- 
rithms. MIT Press. 

Zitzler, E., Laumanns, M., & Thiele, L. (2002) SPEA-2: 
Improving the strength Pareto approach. Proceedings 
ofEUROGEN 2001, Evolutionary Methods for De- 
sign, Optimization, and Control with Applications to 
Industrial Problems, K. Giannakoglou, D. Tsahalis, J. 
Periaux, P. Papailou, and T. Fogarty (editors), Athens, 
Greece: 95-100. 



KEY TERMS 

Elitism: A strategy in evolutionary algorithms 
where the best one or more solutions, called the elites, 
in each generation, are inserted into the next, without 
undergoing any change. This strategy usually speeds 
up the convergence of the algorithm. In a multi-objec- 
tive framework, any non-dominated solution can be 
considered to be an elite. 

Evolutionary Algorithm: A class of probabilistic 
algorithms that are based upon biological metaphors 
such as Darwinian evolution, and widely used in op- 
timization. 

Fitness: A measure that is used to determine the 
goodness of a solution for an optimization problem. 

Fitness Landscape: A representation of the search 
space of an optimization problem that brings out the 
differences in the fitness of the solutions, such that those 
with good fitness are "higher". Optimal solutions are 
the maxima of the fitness landscape. 

Generation: Atermused in evolutionary algorithms 
that roughly corresponds to each iteration of the out- 
ermost loop. The offspring obtained in one generation 
become the parents of the next. 

Multi-Objective Optimization: An optimization 
problem involving more than a single objective function. 
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In such a setting, it is not easy to discriminate between 
good and bad solutions, as a solution, which is better 
than another in one obj ective, may be poorer in another. 
Without any loss of generality, any optimization prob- 
lem can be cast as one involving minimizations only. 

Objective Function: The function that is to be op- 
timized. In a minimization problem, the fitness varies 
inversely as the objective function. 

Population Based Algorithm: An algorithm that 
maintains an entire set of candidate solutions, each 
solution corresponding to a unique point in the search 
space of the problem. 

Search Space: Set of all possible solutions for any 
given optimization problem. Almost always, a neigh- 
borhood around each solution can also be defined in 
the search space. 
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INTRODUCTION 

Traditionally, the application of a neural network 
(Haykin, 1999) to solve a problem has required to fol- 
low some steps before to obtain the desired network. 
Some of these steps are the data preprocessing, model 
selection, topology optimization and then the training. 
It is usual to spend a large amount of computational 
time and human interaction to perform each task of 
before and, particularly, in the topology optimization 
and network training. There have been many propos- 
als to reduce the effort necessary to do these tasks 
and to provide the experts with a robust methodology. 
For example, Giles et al. (1995) provides a construc- 
tive method to optimize iteratively the topology of a 
recurrent network. Other methods attempt to reduce 
the complexity of the network structure by mean of 
removing unnecessary network nodes and connections 
like in (Morse, 1994). In the last years, evolutionary 
algorithms have been shown as promising tools to 
solve this problem, existing many competitive ap- 
proaches in the literature. For example, Blanco et al. 
(2001) proposed a master-slave genetic algorithm to 
train (master algorithm) and to optimize the size of the 
network (slave algorithm). For a general view of the 
problem and the use of evolutionary algorithms for 
neural network training and optimization, we refer the 
reader to (Yao, 1999). 

Although the literature about genetic algorithms 
and neural networks is very extensive, we would like 
to remark the recent popularity of multi-objective 
optimization (Coello et al., 2002, Jin, 2006), spe- 
cially to solve the problem of simultaneous training 
and topology optimization of neural networks. These 
methods have shown to perform suitably for this task 
in previous works, although most of them are proposed 
for feedforward models. They attempt to optimize the 



structure of the network (number of connections, hid- 
den units or layers), while training the network at the 
same time. Multi-objective algorithms may provide 
important advantages in the simultaneous training and 
optimization of neural networks: They may force the 
search to return a set of optimal networks instead of a 
single one; they are capable to speed-up the optimization 
process; they may be preferred to a weight-aggregation 
procedure to cover the regularization problem in neural 
networks; and they are more suitable when the designer 
would like to combine different error measures for the 
training. A recent review of these techniques may be 
found in (Jin, 2006). 



BACKGROUND 

Multi-objective algorithms have become popular in 
the last years to solve the problem of the simultaneous 
training and topology optimization of neural networks, 
because of the innovations they can provide to solve it. 
Certain authors have addressed this problem through 
the evolution of single ensembles as for example with 
DIVACE-II (Chandra et al., 2006), which also imple- 
ments different levels of coevolution. In other works, 
the networks are fully evolved and the evolutionary 
operators are designed to deal with both training and 
structure optimization. Some authors have addressed 
the problem of the structure optimization attending to 
reduce either the number of network neurons or either 
the number of network connections . In the first methods 
(Abbass et al., 2001; Delgado et al., 2005; Gonzalez et 
al., 2003), the optimization is easier since the codifica- 
tion of a network contains a smaller number of freedom 
degrees than the last methods; however, they have a 
disadvantage in the sense that the networks obtained 
are fully connected. On the other hand, the methods in 
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the second place (Jin et al., 2004; Cuellar et al, 2007) 
attempt to reduce the number of connections but it is 
not ensured that also the number of network nodes is 
also minimum. Nevertheless, experimental results have 
shown that the networks obtained with these proposals 
have a low size (Jin et al, 2004). 

The hybridation of multi-objective evolutionary 
algorithms with traditional gradient-based training 
algorithms has also provided promising results. While 
the evolutionary algorithm makes a wide exploration 
of the solution space, the gradient-based algorithms are 
capable to address the search to promising areas dur- 
ing the evolution and to exploit the solutions suitably. 
This hybridation is usually carried out by including 
the gradient-based training method as a local search 
operator in the evolutionary process. Then, the local 
search operator is applied after the mutation and before 
the evaluation of the solutions. Some examples are the 
system MPANN developed by H.A. Abbass (2001), 
and the works by Y. Jin et al. (2006). 

In the next section, we make an study of different 
aspects concerning the multi-objective optimization 
of neural networks. Concretely, we make an study of 
the objectives to be achieved in the multi-objective 
algorithm and the multi-objective algorithms used. 
We focus our analysis on recurrent neural networks 
(Haykin, 1999; Mandic and Chambers, 2001), since 
these models have a high complexity due to the recur- 
rence. The experiments are illustrated in problems of 
time-series prediction, since this type of problems has 
multiple applications in many research and enterprise 
areas and the neural models used are suitable for this 
application, as suggested by previous works (Aussem, 
1999). 



MULTI-OBJECTIVE EVOLUTIONARY 
ALGORITHMS FOR NEURAL 
NETWORKS TRAINING AND 
OPTIMIZATION 

The most recent multi-objective evolutionary algo- 
rithms are based in the concept of Pareto dominance as 
a criterion to determine whether a solution is optimal or 
not. LetF(s)=(f 1 (s),f 2 (s),...,f k (s))be a set of /(objectives 
to be achieved, and let s 1 and s 2 be two solutions. In a 
minimization problem, it is said that s 2 is dominated 
by s 1 if, and only if: 



Vi :1< i < k)A(3j : f .(sj < fj(s 2 )'X< j < k) (i) 

The solutions that are non-nominated by any other 
solution are called the non-dominated set or Pareto 
frontier. The goal of any multi-objective algorithm is 
to find the solutions in the Pareto frontier. Thus, the 
selection of the objectives to be achieved in a multi- 
objective algorithm is a key aspect, since they will be 
used to guide the search across the search space to 
obtain the optimal solutions. However, the higher the 
number of objectives is, the higher the complexity of 
the search space is. In this work, we attempt to train and 
optimize the size of an Elman Network (Mandic and 
Chambers, 2001), for time series prediction problems. 
This network type has an input layer, an output layer 
and a hidden layer. The data of the time series is pro- 
vided in time to the network inputs, and the objective 
is the network output to provide the future values of 
the time series at the output.The recurrent connections 
are in the hidden layer, so that the output of a hidden 
neuron at time t is also input for all the hidden neurons 
at time t+1. The reader may found a wider information 
about dynamical recurrent neural networks applied for 
time series prediction in (Aussem, 1999; Mandic and 
Chambers, 2001). 




^(s^^minif^s^^mini^X^W-OW) 2 } 

-* t=i 



f 2 (s*) = min{ f 2 (s)}=min{h(s)} 



f 3 (s*) = min{ f 3 (s)} = min{ n(s)} 



(2) 



(3) 



(4) 



For the problem of neural network optimization and 
training, we consider three objectives to be achieved 
(see equations (2)-(4)). The objective f T (s) attempts 
to minimize the network error, while f 2 (s) is used to 
optimize the number of hidden neurons and f 3 (s) the 
number of network connections. In equation (2), T 
is the number of training patterns, Y(t) is the desired 
output for pattern t and O(t) is the network output. In 
equation (3), h(s) is the number of hidden neurons for 
the network s; and n(s) is the number of network con- 
nections in equation (4). Another issue related to the 
objectives is the network codification. For example, in 
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works like in (Abbass, 2001), the objectives to achieve 
are (f x (s), f 2 (s)), obtaining fully connected networks. In 
this cases, the representation of the network attempts to 
codify the neurons if a binary vector, and the network 
weights in a matrix with real values. In other works 
like in (Jin et al., 2006), the network connections are 
codified into a binary matrix and the network weight 
into a matrix with real values, since the objectives to 
be optimized are (f^sjj^s)). If we would consider to 
optimize all the objectives, the representation should 
contain the network structure (hidden neurons and con- 
nections) and the weights. In our proposal, the number 
of network neurons are codified with an integer valur 
following the guidelines in (Delgado et al. , 2005 ; Cuellar 
et al., 2007), the connections are codified in a binary 
vector, and the network weights in a vector with real 



Figure 1. Example for the codification of an Elman 
network with 1 input, 1 output, 2 hidden neurons and 
5 connections 
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values. Figure 1 shows an example of the codification 
of an Elman network into a solution, where V. are the 
network weights from input j to hidden neuron z, U 
is the recurrent weight from neuron r to neuron i and 
W oi are the weights from hidden neuron z to the output 
neuron o. A network connection is active if the cor- 
responding gene is set to 1. Otherwise, the connection 
is not active. 

The evolutionary operators like the crossover and the 
mutation should consider two different areas in a solu- 
tion: Structural recombination/mutation, and genetic 
recombination/mutation. The genetic one is associated 
to the area of the network weights, while the structural 
one is for the network topology. Additionally, it could 
be included a local search operator to improve the net- 
work performance locally in the area that codifies the 
network weights, as suggested in the previous section. 
In our experiments, we have used a simple recombina- 
tion that generates two children from two parents, with 
no structural recombination: We have tested that the 
structural recombination could provide a high exploi- 
tation of the solution space, and the selective pressure 
produced by the objectives to be achieved could then 
produce a premature convergence. On the other hand, 
for the mutation we have included three probabilities 
since the structural mutation may have a high impact 
in the population of solutions: Structural mutation is 
selected with probability p x , and genetic mutation with 
probability 1- p r In the structural mutation, the num- 
ber of hidden neurons is altered with probability p 2 ; 
otherwise, the active/inactive connections are mutated. 
Finally, a gene is altered with probability p 3 . Figure 2 
shows an example of the crossover, and Figure 3 shows 
an example of the structural mutation for the number of 



Figure 2. Example of the crossover 
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Figure 3. Example of mutation 
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hidden neurons. In this last case, the solution grows to 
have three hidden neurons, and new genes are gener- 
ated. The values for these genes are a random number 
in the bounds of the gene. In Figure 3, these genes may 
be recognized by mean of the symbol ?. 

Forthe experiments, we attemptto study the benefits 
of the inclusion of a higher number of objectives to 
achieve in the multi-obj ective algorithm, and the effects 
of the evolutionary algorithm. To illustrate our results, 
we have selected two social-economic time-series 
for forecasting: The evolution of the U.S. population 
from 1950 to 2004 taken monthly (USPop), and the 
evolution of the euro/U.S. dollar variation from 1995 
to 2004, taken monthly (EurDol). The 80% of the data 
are used for training, and the remaining 20% is for 
the test. Both time series may be downloaded for free 



from http://www.economagic.com. The parameter for 
the networks in our experiments are bounded for the 
number of hidden neurons, from 3 to 12. The networks 
have one input for the value of the time series at time t, 
and one output for the value of the time series at time 
t+1, to be predicted. There have been 30 experiments 
with the multiobj ective algorithms, which are based in 
the algorithms NSGA2 (Deb et al, 2002) and SPEA2 
(Zitzler et al, 2001). We label NSGA2 and SPEA2 
for the algorithms that optimize the objectives (f t (s), 
f 2 (s)) and NSGA2. connect and SPEA2. connect for the 
algorithms that optimize (f x (s), f 2 (s), f 3 (s)). The stop- 
ping criterion is to have 10000 solutions evaluated, 
and size of the population is 50. The parameters for 
mutation are (p p p 2 , p 3 )=(0.5, 0.5, 0.1) and the range 
for the genes containing network weights is [-5.0, 
5.0]. We have used the binary tournament selection, 
the heuristic Wright's crossover and the displacement 
mutation for the evolutionary operators. 

Figure 4 draws the distribution of the performance 
for the neural networks obtained in the Pareto frontiers 
for the 30 experiments, in each data set. Additionally, 
Table 1 shows the best Pareto frontiers obtained, where 
Column 1 plots the algorithm, columns 2 and 5 expose 
the number of hidden neurons, columns 3 and 6 describe 
the number of network connections, and columns 4 and 
7 the Mean Square Error (MSE) in the training. 

We may observe that SPEA2. connect has obtained 
a Pareto frontier wider than NSGA2. connect in both 
problems. In some situations, this fact may be desir- 




Figure 4. Boxplots for the distribution of network performance in USPop (a), and EurDol (b) 
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Table 1. Best Pareto frontiers obtained in the data sets 





USPop 


EurDol 


Algorithm 


Hidden units 


Connections 


MSE 


Hidden units 


Connections 


MSE 


SPEA2. connect 


3 


6 


0.042 


3 


11 


0.013 


5 


17 


0.041 


4 


7 


0.012 


11 


43 


0.028 


5 


16 


0.009 


10 


44 


0.050 


SPEA2 


4 


24 


0.021 


4 


24 


0.016 


NSGA2.connect 


3 


10 


0.092 


3 


4 


0.013 


5 


17 


0.012 


NSGA2 


7 


63 


0.007 


4 


24 


0.003 



able, since we are provided with a larger set of optimal 
networks from we could select the best appropiate net- 
work to solve our problem. On the other hand, figures 
4.a and 4.b suggest that the inclusion of the network 
connections optimization in the multi-objective method 
may produce poorer results. In both problems, the best 
solutions are provided by the algorithm NSGA2, which 
return fully connected networks. The same algorithm 
that also optimize the number of connections, NSGA2. 
connect, provides networks with minimum size, but 
the network performance are lower. In the case of 
the algorithms based on SPEA2, we may notice that 
SPEA2. connect is the less robust algorithm, since the 
distribution in the MSE is the widest. However, the 
boxplot shows that the best solutions of this method 
may be similar to the ones from NSGA2. This fact sug- 
gests that we may encounter smaller networks using 
the three objectives in SPEA2. connect, but sacrifying 
some improvements in the network performance and 
spending more computational time to obtain a suitable 
solution. Moreover, the networks obtained with the 
inclusion of objective f 3 (s) in the optimization process 
have a size which is very low being compared with the 
fully connected networks from SPEA2 and NSGA2. 



therefore to find optimal solutions. The hybridation 
of multi-objective evolutionary algorithms with non- 
linear programming methods to address the search 
space to promising areas have proved to work well in 
the works that propose a lower number of objectives 
in the optimization. In the case studied in this work, 
the improvements of these procedures could be better 
since the size of the search space is wider. Another 
important issue is the research of the evolution con- 
sidering diversity and convergence: The objectives 
used usually introduce a high selective pressure in the 
population, and specially the objectives for the topology 
optimization. This could be addressed by introducing 
components in the evolutionary process to control the 
balance in diversity/convergence, therefore improving 
the search process and the exploration/exploitation of 
the solution space. 

Another interesting line to work is the inclusion 
of objectives to improve another properties of neural 
networks like noise tolerance or generalization. For 
example, this issue has been suggested in (Graning et 
al, 2006), where it is introduced an extra objective to 
improve the generalization of a feedforward network 
in binary classification. 



FUTURE TRENDS 



CONCLUSION 



We have studied in the previous section that the 
inclusion of a larger number of objectives for the 
network optimization is able to reduce the size of the 
network, although the network performance obtained 
is poorer. This is usual since the more objectives to be 
optimized, the more complex is the search space and 



In this work, we have studied the benefits and disad- 
vantages in multi-objective training and fully topology 
optimization of recurrent neural networks. We have 
tested the methods in time series prediction problems, 
and they have been also compared with the methods that 
do also optimize the number of connections. In general 
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terms, all the algorithms have solved the problems 
suitably. The methods studied provide networks with 
minimum number of hidden units and connections, and 
the network performance is 

good. However, these methods may produce poorer 
results than those that only optimize the number of 
hidden neurons and provide fully connected networks. 
Using a higher computational time, the results from 
the algorithms that optimize the topology, in terms of 
hidden neurons and connections, may be competitive, 
providing networks with performance similar to those 
techniques that do not optimize the number of network 
connections. Moreover, these methods include the 
advantage that the network's size is very low, being 
compared with the fully connected networks. 
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KEY TERMS 

Dynamical Recurrent Neural Networks : Artificial 
neural network that include recurrent connections in 
the network structure. They are capable to process pat- 
terns with undetermined size and/or indexed in time. 
The output in these networks at time t+1 are computed 
using the network inputs at time t and the network state, 
provided by the recurrent connections. 

Ensembles: Self -containing area of a neural network 
(neuron, connection, setof aneuron with connections...) 
that, being combined with other ensembles, is able to 
build a neural network that solves a problem. 



Evolutionary Algorithm: Optimization algorithm 
based on Darwinian nature evolution. 

Multi-Objective Optimization: Optimization of a 
problem that involves the satisfacibility or optimiza- 
tion of two or more objectives, sometimes opposed 
each other. 

Regular ization: Optimization of both complexity 
and performance of a neural network following a linear 
aggregation or a multi-objective algorithm. 

Time-Series: Data sequence indexed in time. 

Time-Series Prediction: Problem that involves the 
prediction of the future values of a time series, consid- 
ering a few values from the data set in the past. 
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INTRODUCTION 

In a companion article of this Encyclopaedia: 'Nar- 
rative' Information, the Problem, we have introduced 
the problem of finding a complete and computationally 
efficient system for representing and managing 'nonfic- 
tional narrative information'. We have stressed there 
the important economic value of this multimedia type 
of information - that concerns, e.g., corporate memory 
documents, news stories, normative and legal texts, 
medical records, intelligence messages, surveillance 
videos or visitor logs, actuality photos, eLearning and 
Cultural Heritage material, etc. We have also empha- 
sised that the usual Computer Science tools - including 
those pertaining to the now very popular 'Semantic 
Web' domain, see (Bechhofer et al., 2004, Beckett, 
2004) - are not really suitable for dealing with this 
type of information. 



BACKGROUND 

In this article, we will present an Artificial Intelligence 
tool, NKRL (Narrative Knowledge Representation Lan- 
guage) that has been especially developed for dealing 
in an 'intelligent' way with the nonfictional narrative 
information. NKRL is, at the same time: 

a knowledge representation system for describing 
in the best possible detail the essential content 
(the 'meaning') of complex nonfictional 'narra- 
tives'; 

a system of reasoning (inference) procedures that, 
thanks to the richness of the representation system, 
is able to automatically establish 'interesting' 
relationships among the represented data; 
an implemen ted software environment that allows 
the user to encode the original narratives in terms 
of the representation language to create 'NKRL 
knowledge bases ' in a specific application domain 
and to exploit 'intelligently' these bases. 



The main innovation introduced by NKRL with 
respect to the usual ontological paradigms concerns 
the addition to the traditional ontology of concepts 
- called HClass, 'hierarchy of classes' in the NKRL's 
jargon - an ontology of events, i.e., a new sort of hier- 
archical organization where the nodes correspond to 
n-ary structures called 'templates' (HTemp, 'hierarchy 
of templates'). A partial image of the 'upper level' of 
HClass - that follows then the standard Protege ap- 
proach, see (Noy et a/., 2000) - is given in Figure 1; 
for HTemp, see Table 1 and Figure 2 below. 



A SHORT DESCRIPTION OF NKRL 

Instead of using the traditional (binary) attribute/value 
organization, the templates are generated from the 

Figure 1 . Apartial representation of the 'upper level 'of 
HClass, the NKRL 'traditional' ontology of concepts. 
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C3 sortal_concept 
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n-ary combination of quadruples connecting together 
the symbolic name of the template, a predicate, and 
the arguments of the predicate introduced by named 
relations, the roles. The quadruples have in common 
the name and predicate components. Denoting then 
with L. the generic symbolic label identifying a given 
template, with P. the predicate used in the template, 
with R k the generic role and with a^the corresponding 
argument, the core data structure for templates has 
the following general format (see also the companion 
article, 'Narrative' Information, the Problem): 



(L.(P.(R 1 a 1 )(R 2 a 2 )...(R a n ))) 



(1) 



Predicates pertain to the set {BEHAVE, EXIST, 
EXPERIENCE, MOVE, OWN, PRODUCE, RECEIVE}, 
and roles to the set {SUBJ(ect), OBJ(ect), SOURCE, 
BEN(e)F(iciary), MODAL(ity), TOPIC, CONTEXT}. 
An argument of the predicate can consist of a simple 
'concept' or of a structured association ('expansion') 
of several concepts. Templates can be conceived as the 
formal representation of generic classes of elementary 
events like "move a physical object", "be present in a 
place", "produce a service", "send/receive a message", 
etc. When a particular event pertaining to one of these 
general classes must be represented, the correspond- 
ing template is instantiated to produce a predicative 
occurrence. 

To represent then a simple narrative like: "On No- 
vember 20, 1999, in an unspecified village, an armed 
group of people has kidnapped Robustiniano Hablo", 
we must then select firstly in the HTemp hierarchy the 
template corresponding to "execution of violent ac- 
tions", see Figure 2 and Table 1 below - this example 
refers to a recent application of NKRL in a 'terrorism' 
context in the framework of an European project see, 
e.g., (Zarri, 2005). 

As it appears from Table la, the arguments of 
the predicate (the a k terms in (1)) are represented by 
variables with associated constraints expressed as 
HClass concepts or combinations of concepts. When 
deriving a predicative occurrence (an instance of a 
template) like mod3.c5 in Table lb, the role fillers in 
this occurrence must conform to the constraints of the 
father-template. For example, ROBUSTINIANOJHABLO 
(the 'BEN(e)F(iciary)' of the action of kidnapping) and 
INDIVIDUAL_PERSON_20 (the unknown 'SUBJECT', 
actor, initiator etc. of this action) are both 'individuals', 
instances of the HClass concept individual_person. The 



constituents - as SOURCE in Table la - included in 
square brackets are optional. A 'conceptual label' like 
mod3.c5 is the symbolic name used to identify the 
NKRL code corresponding to a specific predicative 
occurrence. 

The 'attributive operator', SPECIF(ication), is one 
of the four operators used in NKRL for the construction 
of 'structured arguments' ('complex fillers' or 'expan- 
sions') see, e.g., (Zarri, 2003). The SPECIF lists, with 
syntax (SPECIF e, p . . . p ), are used to represent the 
properties or attributes that can be asserted about the 
first element e., concept or individual, of the list - e.g., 
in the SUBJ filler of mod3.c5, Table lb, the attributes 
weapon_wearing and (SPECIF cardinality_ several_)) 
are both associated with INDIVIDUAL_PERSON_20. 

The ' location attributes ' , represented in the predica- 
tive occurrences as lists, are linked with the arguments 
of the predicate by using the colon operator, ':', see 
the individual VILLAGE_1 in Table lb. In the occur- 
rences, the two operators date-1, date-2 materialize 
the temporal interval normally associated with narra- 
tive events, see (Zarri, 1998) - and, more in general, 
(Allen, 1981, Ferro etal., 2005). 

1 50 templates are permanently inserted into HTemp; 
Figure 2 reproduces the 'external' organization of the 
PRODUCE branch of HTemp. This branch includes 
the Produce: Violence template used in Table 1. HTemp 
corresponds then to a sort of 'catalogue' of narrative 
formal structures, that are very easy to 'customize' to 
derive the new templates that could be needed for a 
particular application. 

What expounded until now illustrates the NKRL 
solutions to the problem of representing 'elementary' 
(simple) events. To deal now with those 'connectivity 
phenomena' that arise when several elementary events 
are connected through causality, goal, indirect speech 
etc. links - see also (Mani and Pustejovsky, 2004) - the 
basic NKRL knowledge representation tools have been 
complemented by more complex mechanisms that make 
use of second order structures, see (Zarri, 2003). For 
example, the binding occurrences consist of lists of 
symbolic labels (c.) of predicative occurrences; the lists 
are differentiated using specific binding operators like 
GOAL, CONDITION and CAUSE. Let us suppose that, 
in Table 1, we state now that: "...an armed group of 
people has kidnapped Robustiniano Hablo in order to 
ask his family for a ransom", where the new elementary 
event: "the unknown individuals will ask for a ransom" 
corresponds to a new predicative occurrence, e.g., mod3. 
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Table 1. Building up and querying predicative occurrences 




a) 

name: Produce:Violence 

father: Produce: PerformTask/Activity 

position: 6.35 

NL description: 'Execution of Violent Actions on the Filler of the BEN(e)F(iciary) Role' 

PRODUCE SUB J varl: [(var2)] 
OBJ var3 

[SOURCE var4: [(var5)]] 
BENF var6: [(var7)] 

[MODAL var8] 
[TOPIC var9] 
[CONTEXT varlO] 

{[modulators], 7^abs} 

varl = <human_being_or_social_body> 

var3 = <violence_> 

var4 = <human_being_or_social_body> 

var6 = <human_being_or_social_body> 

var8 = <criminality/violence_related_tool> | <machine_tool> | <violence_> \ 

<weapon_> 
var9 = <h_class> 

varlO = <situation_> | <spatio/temporal_relationship> | <symbolic_label> 

var2, var5, var7 - <geographical_location> 

b) 

mod3.c5) PRODUCE SUBJ (SPECIF INDIVIDUAL_PERSON_20 weapon_wearing 

(SPECIF cardinality_ several_)): 
(VILLAGE_1) 

OBJ kidnapping_ 

BENF ROBUSTINIANO_HABLO 

CONTEXT #mod3.c6 
date-1: 20/11/1999 

date-2: 
Produce: Violence (6.35) 



On November 20, 1999, in an unspecified village fVILLAGE_l), an armed group of people has kidnapped 
Robustiniano Hablo. 

c) 

PRODUCE 
SUBJ : human_being : 
OBJ : violence_ 
BENF : human_being : 

{} 

datel : 1/1/1999 

date2 : 31/12/1999 

There is any information in the system concerning violence activities during 1999? 
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Figure 2A. Partial representation of the PRODUCE branch ofHTemp, the 'ontology of events' 



lb HTEMP - Hierarchy of Templates 
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Produce: 
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Q Produce: Positive Con ditiorj/Rfesuft 

□ Produce:lncrernent/Decrernent 
Exist: 

Receive: 
Behave: 



► 



c7. To represent this situation, we must add to the oc- 
currences that represent the two elementary events a 
new binding occurrence, e.g., mod3.c8, to link together 
the conceptual labels mod3.c5 (corresponding to the 
kidnapping occurrence, see also Table lb) and mod3. 
c7 (corresponding to the new occurrence describing 
the intended result). mod3.c8 will have then the form: 
"mod3.c8) (GOAL mod3.c5 mod3.c7)". The meaning of 
mod3.c8 can be paraphrased as: "the activity described 
in mod3.c5 is focalised towards (GOAL) the realization 
of mod3.c7". 

Reasoning in NKRL ranges from the direct ques- 
tioning of an NKRL knowledge base making use of 
search patterns (formal queries over the contents of 
the knowledge base) that try to unify the predicative 
occurrences of the base to high-level inference proce- 
dures. A simple example of search pattern in supplied in 
Table lc, producing as an answer, among other things, 
the predicative occurrence mod3.c5 of Table lb - see 
(Ellis, 1995, Corbett, 2003, etc.) for the techniques 
used to unify complex conceptual structures. With 
respect now to the high level procedures - a detailed 



paper on this topic is (Zarri, 2005) - the transforma- 
tion rules try to 'adapt', from a semantic point of 
view, the original query/queries (search patterns) that 
failed to the real contents of the existing knowledge 
bases. The principle employed consists in using rules 
to automatically 'transform' the original query (i.e., 
the original search pattern) into one or more different 
queries (search patterns) that are not strictly 'equiva- 
lent' but only 'semantically close' to the original one. 
Let us suppose that, e.g., during the search for all the 
possible information linked with the Robustiniano 
Hablo 's kidnapping, we ask the system whether Ro- 
bustiano Hablo is wealthy. In the absence of a direct 
answer, the system will automatically 'transform' the 
original query using a rule like: "In a context of ransom 
kidnapping, the certification that a given character is 
wealthy or has a professional role can be substituted by 
the certification that: i) this character has a tight kinship 
link with another person, and ii) this second person is 
a wealthy person or a professional people". The final 
result can then be paraphrased in this way: we do not 
know whether Robustiano Hablo is wealthy, but we 
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can say that his father is a wealthy businessperson, see 
(Zarri, 2005) for the details. 

Hypothesis rules allow building up 'reasonable' 
logic/semantic connections among the data stored in 
an NKRL knowledge base using a number of pre-de- 
fined reasoning schemata, e.g., ( causaV schemata. For 
example, to mention a 'classic' NKRL example, after 
having directly retrieved through the use of a search 
pattern an information like: "Pharmacopeia, an USA 
biotechnology company, has received 64,000,000 dol- 
lars from the German company Schering in connection 
with an R&D activity", we could be able to automati- 
cally construct a sort of 'causal explanation' of this 
event by retrieving information like: i) "Pharmacopeia 
and Schering have signed an agreement concerning 
the production by Pharmacopeia of a new compound" 
and ii) "in the framework of the agreement previously 
mentioned, Pharmacopeia has actually produced the 
new compound". 

In Table 2, we give the informal description of 
the reasoning steps (called 'condition schemata' in a 
hypothesis context) that must be validated to prove 
that a generic 'kidnapping' corresponds, in reality, to 
a more precise 'kidnapping for ransom' environment. 
When several reasoning steps must be simultaneously 
validated, as in Table 2, a failure is always possible. To 
overcome this problem - and, at the same time, discover 
all the possible implicit information associated with the 
original data- the two inference modes, transformation 
and hypotheses, can be used in an integrated way, see 
(Zarri, 2005). In practice, we make use of 'transforma- 
tions' within a 'hypothesis' context. This means that, 
whenever a ' search pattern ' is derived from a ' condition 
schema' of a hypothesis to implement one of the steps 
of the reasoning process, we can use it 'as it is' - i.e., as 
originally coded when the inference rule has been built 
up - but also in a 'transformed' form if the appropriate 



transformation rules exist within the system. 

Making use of the transformation rules already exist- 
ing within the system, the hypothesis represented in an 
informal way in Table 2 becomes, in practice, potentially 
equivalents the hypothesis of Table 3 . For example, the 
proof that the kidnappers are part of a terrorist group or 
separatist organization (reasoning step Condi of Table 
2) can be now obtained indirectly, transformation T3, 
by checking whether they are members of a specific 
subset of this group or organization. 



FUTURE TRENDS 

NKRL is a fully implemented language/environment. 
The software exists in two versions, an ORACLE-sup- 
ported and a file-oriented one. Future improvements 
will concern mainly: 

The addition of features that will allow us querying 
the system in Natural Language. Very encouraging 
experimental results have already been obtained in 
this context thanks to the combined use of shallow 
parsing techniques - see, e.g., (Koster, 2004) and 
of the standard NKRL inference capabilities. 
On a more ambitious basis, the introduction of 
some features for the semi-automatic construction 
of the knowledge base of annotation/occurrences 
making use of full NL techniques. Some prelimi- 
nary work in this context has been realised making 
use of the syntactic/semantic Caf etiere tools, see 
(Black eta/., 2003, 2004). 
The introduction of optimisation techniques for the 
(basic) chronological backtracking of the NKRL 
InferenceEngine, in the style of the well-known 
techniques developed in a Logic Programming 
context see, e.g., (Clark and Tarnlund, 1982). 




Table 2. Inference steps for the 'kidnapping for ransom' hypothesis 



(Condi) The kidnappers are part of a separatist movement or of a terrorist organization. 

(Cond2) This separatist movement or terrorist organization currently practices ransom kidnapping of particular 

categories of people. 
(Cond3) In particular, executives or assimilated categories are concerned. 
(Cond4) It can be proved that the kidnapped is really a businessperson or assimilated. 
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Even in its present form, NKRL has been able to 
deal successfully, in a ' intelligent information retrieval ' 
mode, with the most different 'narrative' domains 
- from history of France to terrorism, from Falkland 
War to the corporate domain, from the legal field to 
the beauty care domain or the analysis of customers' 
motivations, etc. 



CONCLUSION 

In this article, we have supplied some details about 
NKRL (Narrative Knowledge Representation Lan- 
guage), a fully implemented, up-to-date knowledge 
representation and inference system especially created 
for an ' intelligent' exploitation of narrative knowledge. 
The main innovation of NKRL consists in associat- 
ing with the traditional ontologies of concepts an 



'ontology of events', i.e., a hierarchical arrangement 
where the nodes correspond to n-ary structures called 
'templates'. 
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KEY TERMS 

Attributive Operator: The 'attributive operator', 
SPECIF(ication), is one of the four operators used in 
NKRL for the construction of 'structured arguments' 
('complex fillers' or 'expansions') of the conceptual 
predicates. The SPECIF lists, with syntax (SPECIF 
e ; p . . . p ), are used to represent the properties or at- 
tributes that can be asserted about the first element e., 
concept or individual, of the list. 

Binding Occurrences: Second order structures used 
to deal with those 'connectivity phenomena' that arise 
when several elementary events are connected through 
causality, goal, indirect speech etc. links. They consists 
of lists of symbolic labels (c.) of predicative occurrences; 
the lists are differentiated using specific binding opera- 
tors like GOAL, CONDITION and CAUSE. 

Format of NKRL Templates: Templates take 
the form of n-ary combinations of quadruples con- 
necting together the 'symbolic name' of the template, 
a 'conceptual predicate' and the 'arguments' of the 
predicate introduced by named relations, the 'roles' 
(like SUBJ(ect), OBJ(ect), SOURCE, 
BEN(e)F(iciary), etc.). The quadruples have 
in common the 'name' and 'predicate' components. 
Denoting then with L. the symbolic label identifying 
the template, with P. the predicate, with R the generic 
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role and with a k the generic argument, the core data 
structure for templates has the format: 

(L.(P.(R 1 a 1 )(R 2 a 2 )...(R a n ))). 

Templates are included in an inheritance hierarchy, 
HTemp(lates), which implements NKRL's 'ontology 
of events'. 

NKRL Inference Engine: A software module that 
carries out the different 'reasoning steps' included in 
hypotheses or transformations. It allows us to use these 
two classes of inference rules also in an 'integrated' 
mode, augmenting then the possibility of finding in- 
teresting (implicit) information. 

NKRL Inference Rules, Hypotheses: They are 
used to build up automatically ' reasonable ' connections 
among the information stored in an NKRL knowledge 
base according to a number of pre-defined reasoning 
schemata, e.g., 'causal' schemata'. 



NKRL Inference Rules, Transformations: These 
rules try to 'adapt', from a semantic point of view, 
a query that failed to the contents of the existing 
knowledge bases. The principle employed consists in 
using rules to automatically 'transform' the original 
query into one or more different queries that are not 
strictly 'equivalent' but only 'semantically close' to 
the original one. 

Ontology of Concepts vs. Ontology of Events: 

The ontologies of concepts concern the 'standard' 
hierarchical organizations of concepts to be used to 
model (in a 'static' way) a given domain. NKRL adds 
an 'ontology of events', i.e., a new sort of hierarchical 
organization where the nodes, represented by n-ary 
structures called 'templates', represent general classes 
of 'dynamical' events like "move a physical object", 
"produce a service", "send/receive a message", etc. 
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INTRODUCTION 

'Narrative' information concerns in general the account 
of some real-life or fictional story (a 'narrative') involv- 
ing concrete or imaginary 'personages'. In this article 
we deal with (multimedia) nonfictional narratives of 
an economic interest. This means, first, that we are 
not concerned with all sorts of fictional narratives that 
have mainly an entertainment value, and represent an 
imaginary narrator's account of a story that happened 
in an imaginary world: a novel is a typical example 
of fictional narrative. Secondly, our 'nonfictional nar- 
ratives' must have an economic value: they are then 
typically embodied into corporate memory documents, 
they concern news stories, normative and legal texts, 
medical records, intelligence messages, surveillance 
videos or visitor logs, actuality photos and video frag- 
ments for newspapers and magazines, eLearning and 
multimedia Cultural Heritage material, etc. 

Because of the ubiquity of these 'narrative', 'dy- 
namic' resources, it is particularly important to build 
up computer-based applications able to represent and 
to exploit in a general, accurate, and effective way the 
semantic content - i.e., the key 'meaning' - of these 
resources. 



BACKGROUND 

'Narratives' represent presently a very 'hot' domain. 
From a theoretical point of view, they constitute the 
object of a full discipline, the 'narratology', whose 
aim can be defined as that of producing an in-depth 
description of the 'syntactic/semantic structures ' of 
the narratives, i.e., the narratologist is in charge of 
dissecting narratives into their component parts in 
order to establish their functions, their purposes and 
the relationships among them. A good introduction to 
the full domain is (Jahn, 2005). 

Even if narratology is particularly concerned with 
literary analysis (and, therefore, with 'fictional' narra- 



tives), these last years some of its varieties have acquired 
a particular importance also from a strict Artificial 
Intelligence (AI) and Computer Science (CS) point of 
view. Leaving apart the old dream of generating fictions 
by computer, see (Mehan, 1977) and, more recently, 
(Callaway and Lester, 2002), we can mention here two 
new disciplines, 'storytelling' and 'eChronicles', that 
are of interest from both a nonfictional narratives and 
a AI/CS point of view. 

Storytelling - see, e.g., (Soulier, 2006) - concerns 
in general the study of the different ways of conveying 
'stories 'and events in words, images and sounds in order 
to entertain, teach, explain etc. Digital Storytelling deals 
in particular with the ways of introducing characters 
and emotions in the interactive entertainment domain, 
and concerns then videogames, massively multiplayer 
online games, interactive TV, virtual reality etc., see 
(Handler Miller, 2004). Digital Storytelling is, therefore, 
related to another, computer-based variant of narratol- 
ogy called Narrative Intelligence, a sub-domain of 
AI that explores topics at the intersection of Artificial 
Intelligence, media studies, and human computer in- 
teraction design (narrative interfaces, history databases 
management systems, artificial agents with narrative 
structured behaviour, systems for the generation and/or 
understanding of histories/narratives etc.), see (Mateas 
and Sengers, 2003). 

An eChronicle system can be defined in short as way 
of recording, organizing and then accessing streams 
of multimedia events captured by individuals, groups, 
or organizations making use of video, audio and other 
sensors. The 'chronicles' gathered in this way may 
concern any sort of 'narratives' from meeting minutes 
to football games, sales activities, 'lifelogs' obtained 
from wearable sensors, etc. The technical challenges 
concern mainly the ways of aggregating the events into 
coherent 'episodes' making use of domain models as 
ontologies, and providing then access to this sort of 
material to the users at the required level of granularity. 
Note that exploration, and not 'normal' querying, is 
the predominant way of interaction with the chronicle 
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repositories; more details can be found, e.g., in (Giiven, 
Podlaseck and Pingali, 2005), (Westermann and Jain, 
2006). 

The solution (NKRL) proposed for the 'intelligent' 
management of nonfictional narratives in the companion 
article - 'Narrative' Information, the NKRL Solution 
- of the present one is considered as a fully-fledged 
eChronicle technique, see (Zarri, 2006). In NKRL, 
however, a fundamental aspect concerns the presence 
of powerful 'reasoning' techniques - an aspect that is 
not taken into consideration sufficiently in depth in 
eChronicles that are mainly interested in the accumula- 
tion of narrative materials more than in the 'intelligent' 
exploitation of their inner relationships. 



REPRESENTING THE 'NONFICTIONAL' 
NARRATIVES 

All the different sorts of 'nonfictional narratives ' evoked 
in the previous Sections concern, practically, the de- 
scription of spatially and temporally characterised 
'events' that relate, at some level of abstraction, the 
behaviour or the state of some real-life 'actors' (char- 
acters, personages, etc.): these try to attain a specific 
result, experience particular situations, manipulate 
some (concrete or abstract) materials, send or receive 
messages, buy, sell, deliver etc. Note that: 

The term 'event' is taken here in its most general 
meaning, covering also strictly related notions 
like fact, action, state, situation, episode, activity 
etc. 

The 'actors 'or 'personages 'involved in the events 
are not necessarily human beings: we can have 
narratives concerning, e.g., the vicissitudes in 
the journey of a nuclear submarine (the 'actor', 
'subject' or 'personage') or the various avatars 
in the life of a commercial product. 
Even if a large amount of nonfictional narratives 
are embodied within natural language (NL) texts, 
this is not necessarily true: narrative information 
is really 'multimedia'. A photo representing a 
situation that, verbalized, could be expressed as 
"The US President is addressing the Congress" 
is not of course an NL document, yet it surely 
represents a narrative. 



An in-depth analysis of the existing Knowledge 
Representation solutions that couldbe used to represent 
and manage nonfictional narratives endowed with the 
above characteristics is beyond the possibilities of this 
article - see in this context, e.g., (Zarri, 2005). We will 
limit ourselves, here, to some quick consideration. 

We can note, first of all, that the now so popular Se- 
mantic Web (W3C) languages like RDF (Resource De- 
scription Framework), see (Manola and Miller, 2004), 
and OWL (Web Ontology Language), see (McGuinness 
and Harmelen, 2004) are unable to fit the bill because 
their core formalism consists in practice of the classical 
'attribute - value' model. For these 'binary' languages 
then, a property can only be a binary relationship, link- 
ing two individuals or an individual and a value. When 
these languages must represent simple 'narratives' like 
"John has given a book to Mary", several difficulties 
arise. In this extremely simple sentence, e.g., "give" is 
an n-ary (ternary) relationship that, to be represented 
in a complete way, asks for the presence of a specific 
'semantic predicate' in the "give" or "transfer" style, 
where the 'arguments', "John", "book" and "Mary", 
of the predicate must be labelled with 'conceptual 
roles' such as, e.g., 'agent of give', 'object of give' 
and 'beneficiary of give' respectively. 

Efforts for extending the W3C languages by intro- 
ducing some n-ary feature have been not very successful 
until now: see, in this context, a recent working paper 
from the W3C Semantic Web Best Practices and De- 
ployment Working Group (SWBPD WG) about "Defin- 
ing N-ary Relations on the Semantic Web" (Noy and 
Rector, 2006). This paper proposes some extensions to 
the binary paradigm to allow the correct representation 
of 'narratives' like: "Steve has temperature, which is 
high, but failing" or "United Airlines flight 3 177 visits 
the following airports: LAX, DFW, and JFK". The 
technical solutions expounded in this paper are not 
very convincing and have aroused several criticisms. 
These have focused, mainly, on i) the fact that the ma- 
jority of the solutions proposed do not deal, in reality, 
with the n-ary problem, but with (only loosely) related 
matters like the possibility of specifying a 'standard' 
binary relationship via the addition of properties, and 
ii) on the arbitrary introduction, through reification 
processes, of fictitious (and inevitably ad hoc) 'indi- 
viduals' to represent the n-ary relations when these are 
actually dealt with. Moreover, the paper say nothing, 
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e.g., about the way of dealing, in concrete 'narrative' 
situations, with those crucial 'connectivity phenomena ' 
like causality, goal, indirect speech, co-ordination and 
subordination etc. that link together the basic pieces 
of information - e.g., the 'basic events' corresponding 
to the present illness state of Steve with other 'basic 
events' corresponding to the (possible or definite) 
'causes' of such state. 

Several solutions for representing narratives in 
computer-usable ways according to some sort of actual 
'n-ary model' have been described in the literature. 
For example, in the context of his work - between 
the mid-fifties and the mid-sixties - on the set up of 
a mechanical translation process based on the simula- 
tion of the thought processes of the translator, Silvio 
Ceccato (Ceccato, 1961) proposed a representation 
of narrative-like sentences as a network of triadic 
structures ('correlations') organized around specific 
'correlators' (a sort of roles). Ceccato is also credited 
to be one of the pioneers of the semantic network stud- 
ies; basically, semantic networks are directed graphs 
(digraphs) where the nodes represent concepts, and the 
arcs different kinds of associative links, not only the 
'classical' Is A and property-value links, but also n-ary 
relationships. A panorama of the different conceptual 
solutions proposed in a semantic network context can 
be found in (Lehmann, 1992). 

In the seventies, a sort of particularly popular, n-ary 
semantic network approach has been represented by 
the Conceptual Dependency theory of Roger Schank 
(Schank, 1973). In this theory, the underlying meaning 
('conceptualization') of narrative-like utterances is ex- 
pressed as combinations of 'semantic predicates ' chosen 
from a set of twelve 'primitive actions' (like INGEST, 
MOVE, ATRANS, the transfer of an abstract relation- 
ship like possession, ownership and control, PTRANS, 
physical transfer, etc.) plus states and changes of states, 
and seven role relationships ('conceptual case'). Con- 
ceptual Graphs (CGs) is the representation system de- 
veloped by John Sowa(Sowa, 1984, 1999) and derived, 
at least partly, from Schank's work and other early work 
in the Semantic Networks domain. CGs make use of a 
graph-based notation for representing 'concept-types' 
(organized into a type-hierarchy), 'concepts' (that are 
instantiations of concept types) and 'conceptual rela- 
tions' that relate one concept to another. CGs can be 
used to represents in a formal way narratives like "A 
pretty lady is dancing gracefully" and more complex, 



second-order constructions like contexts, wishes and 
beliefs. CYC, see (Lenat et a/., 1990) concerns one 
of the most controversial endeavours in the history 
of Artificial Intelligence. Started in the early '80 as 
a MCC (Microelectronics and Computer Technology 
Corporation, Texas, USA) project, it ended about 15 
years later with the set up of an enormous knowledge 
base containing about a million of hand-entered ' logical 
assertions ' including both simple statements of facts and 
rules about what conclusions can be inferred if certain 
statements of facts are satisfied. The 'upper level' of 
the ontology that structures the CYC knowledge base 
is now freely accessible on the Web, see http://www. 
cyc.com/cyc/opencyc.Adetailed analysis of the origins, 
developments and motivations of CYC can be found in 
(Bertino et a/., 2001: 275-316). We can also mention 
here another 'modern' system, Topic Maps, see (Rath, 
2003), where information is represented using topics 
(representing any concept, from people to software 
modules and events), associations (the relationships 
between them), and occurrences (the relationships 
between topics and information resources relevant to 
them). They correspond, eventually, to a sort of down- 
graded Semantic Network representation. 

Leaving now aside 'historical' solutions like those 
proposed by Schank or Ceccato, none of the existing n- 
ary solutions mentioned above seem to be able to satisfy 
completely the nonfictional narratives requirements, 
see again (Zarri, 2005) for more details. The universal 
purposes of CYC, the extremely large dimensions of 
its knowledge base and the extreme diversity of the 
contents of this base give rise to serious consistency 
problems, that have apparently restricted the develop- 
ment of concrete applications based on this technol- 
ogy to experimental projects mainly supported by the 
US Government. On the other hand, the knowledge 
representation language of CYC, CycL (substantially, 
a frame system rewritten in logical form) seems to be 
too rigid and uniform to adapt itself to the representa- 
tion of all the different facets (from general concepts 
and elementary events to the connectivity phenomena 
etc.) that characterise the narratives. Conceptual Graphs 
(CGs) could represent, at least in principle, a valid solu- 
tion for dealing with nonfictional narrative information. 
However, it seems evident that work in a CGs context 
concerns mainly, with few exceptions, the 'academic' 
domain, and that the practically-oriented applications 
of CGs are particularly scarce. This becomes particular 
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evident when we consider that the CGs developers still 
lack of an exhaustive and authoritative list of standard 
CGs structures under the form of 'canonical graphs' 
that could constitute a sort of 'catalogue' for dealing 
with practical problems; the set up of a tool like this 
seems never have been planned. The existence of such a 
catalogue could be extremely important for practical ap- 
plications in the narrative (not only) domain given that: 
i) a system-builder should not have to create himself the 
structural and inferential knowledge needed to describe 
and exploit the events proper to a (sufficiently) large 
class of narratives; ii) the reproduction and the sharing 
of previous results could become neatly easier. 

We can add to the above difficulties the existence 
of a series of general problems that are not associated 
with a specific system but that concern by and large all 
the existing n-ary solutions, like the lack of agreement 
about the list of 'roles' (conceptual cases) to be used 
when a narrative must be practically represented into 
conceptual format, or the differences of opinion about 
the use of 'primitives'. 



ACTUAL TRENDS 

In spite of the quite pessimistic considerations of the 
previous Section, conceiving a specific Knowledge Rep- 
resentation tool for dealing in practice with nonfictional 
narrative information is far from being impossible. 
Returning now to the "John gave a book. . ." example 
above - and leaving aside, for the moment being, all 
the additional problems linked, e.g., with the existence 
of the 'connectivity phenomena' - it is not too dif- 
ficult to see that a complete, n-ary representation that 
captures all the 'essential meaning' of this elementary 
narrative amounts to: 



' function ' within the global narrative . J O H N_ will 
thenbeintroducedbyanAGENT(orSUBJECT) 
role, BOOK_l by an OBJECT (or PATIENT) 
role, MARY_ by a BENEFICIARY role. An 
additional information like "yesterday" could be 
introduced by, e.g., a TEMPORAL_ANCHOR 
role, etc. 

'Reify' the obtained n-ary structured associating 
with it an unique identifier under the form of a 
'semantic label', to assure both i) the logical-se- 
mantic coherence of the structure; ii) an rational 
and efficient way of storing and retrieving it. 

Formally, an n-ary structure defined according the 
above guidelines can be described as: 



(L.(P.(R 1 a 1 )(R 2 a 2 )...(R a n ))) 



(1) 



where L . is the symbolic label identifying the particular 
n-ary structure (e.g., the global structure correspond- 
ing to the representation of the "John gave a book. . ." 
example), P. is the conceptual predicate, R k is the ge- 
neric role and a k the corresponding argument (e.g., the 
individuals john_, mary_ etc.). Note that each of the (R. 
a.) cells of (1), taken individually, represents a binary 
relationship in the W3C language style. The main point 
here is, however, that the whole conceptual structure 
represented by (1) must be considered globally. 

The solution represented formally by (1) is at the 
core of a complete and running conceptual tools for the 
representation and management of nonfictional narra- 
tive information called NKRL (Narrative Knowledge 
representation Language), see (Zarri, 2005) and the 
companion article: 'Narrative' Information, the NKRL 
Solution. 



Define JOHN_, MARY_ and BOOK_l as 'in- 
dividuals', instances of general 'concepts' like 
human_being and information_support or of 
more specific concepts. Concepts and instances 
(individuals) are, as usual, collected into a 'binary' 
ontology (built up using a standard tool like, e.g., 
Protege). 

Define an n-ary structure organised around a 
conceptual predicate like, e.g., MOVE or PHYSI- 
CAL_TRANSFER and associate the above indi- 
viduals (the arguments) to the predicate through 
the use of conceptual roles that specify their 



CONCLUSION 

We deal in this article with 'nonfictional narratives'. 
These are information resources of a high economical 
importance that concern, e.g., the 'corporate knowl- 
edge' documents, the news stories, the medical records, 
the surveillance videos or visitor logs, etc. When we 
examine the existing (or past) general Knowledge 
Representation systems that could be used for dealing 
with nonfictional narratives, we can note that none of 
them seem to be able to satisfy completely the non- 
fictional narratives requirements. For example, the 
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W3C (Semantic Web) languages like RDF and OWL 
cannot fit the bill since they are binary-based types of 
representation while narratives ask, in general, for n-ary 
solutions. Aspecific, narrative-oriented formalism able 
to capture the essential 'meaning' of an 'elementary' 
narrative event however exists, see (Zarri, 2005) and 
the companion article: 'Narrative' Information, the 
NKRL Solution. 
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KEY TERMS 

'Binary ' Languages vs. n-ary Languages : Binary 
languages (like RDF and OWL) are based on the clas- 
sical 'attribute - value' model: they are called 'binary' 
because, for them, a property can only be a binary 
relationship, linking two individuals or an individual 
and a value. They cannot be used to represent in an 
accurate way the narratives that ask in general, on the 
contrary, for the use of n-ary knowledge representa- 
tion languages. 

Connectivity Phenomena: In the presence of 
several, logically linked elementary events, this term 
denotes the existence of a global 'narrative' informa- 
tion content that goes beyond the simple addition of 
the information conveyed by the single events. The 
connectivity phenomena are linked with the presence 
of logico-semantic relationships like causality, goal, 
co-ordination and subordination etc. 

Core Format of a Complete Solution for Repre- 
senting Narratives: Formally, an n-ary structure able 
to represent the 'essential meaning' of an 'elementary 
event' can be described as: 



(L.(P.(R 1 a 1 )(R 2 a 2 )...(R a n ))) 

where L. is the symbolic label identifying the particular 
formalized event, P. is the conceptual predicate, R k is 
the generic role and a^the corresponding argument. 



"Narrative" Information Problem 



Examples of n-ary Languages: 'Historical' ex- 
amples of n-ary languages are Ceccato 's ' correlations ' , 
Schank's Conceptual Dependency theory, many Seman- 
tic Networks proposals, etc. Current n-ary systems are, 
e.g., Topic Maps, Sowa's Conceptual Graphs, Lenat's 
CYC, etc. None of them are able to satisfy completely 
the requirements for an 'intelligent' representation and 
management of nonfictional narrative information. 

Narrative Information: Concerns in general the 
account of some real-life or fictional story (a 'narrative') 
involving concrete or imaginary 'personages'. 

Narratology: Discipline that deals with narratives 
from a theoretical point of view. Sub-classes of nar- 
ratology that have a 'computational' interest are, e.g., 
Storytelling, Narrative Intelligence and the eChronicle 
systems. 

Nonfictional Narrative of an Economic Interest: 

In this case, the personages are 'real characters', and 
the narrative happens in the real world. Moreover, the 
narratives are now embodied in multimedia documents 
of an economic interest: corporate memory documents, 
news stories, normative and legal texts, medical re- 
cords, intelligence messages, surveillance videos or 
visitor logs, etc. 
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INTRODUCTION 

During the 20th century, biology — especially molecular 
biology — has become a pilot science, so that many disci- 
plines have formulated their theories under models taken 
from biology. Computer science has become almost a 
bio-inspired field thanks to the great development of 
natural computing and DNA computing. 

From linguistics, interactions with biology have not 
been frequent during the 20th century. Nevertheless, 
because of the "linguistic" consideration of the genetic 
code, molecular biology has taken several models from 
formal language theory in order to explain the structure 
and working of DNA. Such attempts have been focused 
in the design of grammar-based approaches to define a 
combinatorics in protein and DNA sequences (Searls, 
1993). Also linguistics of natural language has made 
some contributions in this field by means of Collado 
(1989), who applied generativist approaches to the 
analysis of the genetic code. 

On the other hand, and only from theoretical interest 
a strictly, several attempts of establishing structural par- 
allelisms between DNA sequences and verbal language 
have been performed (Jakobson, 1973, Marcus, 1998, 
Ji, 2002). However, there is a lack of theory on the at- 
tempt of explaining the structure of human language 
from the results of the semiosis of the genetic code. And 
this is probably the only arrow that remains incomplete 
in order to close the path between computer science, 
molecular biology, biosemiotics and linguistics. 

Natural Language Processing (NLP) -a subfield of 
Artificial Intelligence that concerns the automated gen- 
eration and understanding of natural languages — can 
take great advantage of the structural and "semantic" 
similarities between those codes. Specifically, taking 
the systemic code units and methods of combination 
of the genetic code, the methods of such entity can be 
translated to the study of natural language. Therefore, 
NLP could become another "bio-inspired" science, by 



means of theoretical computer science, that provides the 
theoretical tools and formalizations which are necessary 
for approaching such exchange of methodology. 

In this way, we obtain a theoretical framework where 
biology, NLP and computer science exchange methods 
and interact, thanks to the semiotic parallelism between 
the genetic code and natural language. 



BACKGROUND 

Most current natural language approaches show several 
facts that somehow invite to the search of new formal- 
isms to account in a simpler and more natural way for 
natural languages. Two main facts lead us to look for 
a more natural computational system to give a formal 
account of natural languages: a) natural language 
sentences cannot be placed in any of the families of 
the Chomsky hierarchy (Chomsky, 1956) in which 
current computational models are basically based, and 
b) rewriting methods used in a large number of natural 
language approaches seem to be not very adequate, from 
a cognitive perspective, to account for the processing 
of language. 

Now, if to these we add (1) that languages that have 
been generated following a molecular computational 
model are placed in-between Context-Sensitive and 
Context-Free families; (2) that genetic model offers 
simpler alternatives to the rewriting rules; (3) and that 
genetics is a natural informational system as natural 
language is, we have the ideal scene to propose biologi- 
cal models in NLP. 

The idea of using biological methods in the descrip- 
tion and processing of natural languages is backed up 
by a long tradition of interchanging methods in biology 
and natural/formal language theory: 

1. Results and methods in the field of formal 
language theory have been applied to biology: 
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(1) Pawlak (1965) dependency grammars as an 
approach in the study of protein formation; (2) 
transformational grammars for modeling gene 
regulations (Collado, 1989); (3) stochastic con- 
text-free grammars for modeling RNA (Sakaki- 
bara et al., 1994); (4) definite clause grammars 
and cut grammars to investigate gene structure 
and mutations and rearrangement in it (S earls, 
1 989); (5) tree-adjoining grammars for predicting 
RNA structure of biological data (Uemura et al., 
1999). 

2. Natural languages as models for biology: (1) 
Watson (1968) understanding of heredity as a 
form of communication; (2) Asimov (1968) idea 
that nucleotide bases are letters and they form 
an alphabet; (3) Jacob (1970) consideration that 
the sense of the genetic message is given by the 
combination of its signs in words and by the 
arrangement of words in phrases; (4) Jakobson 
(1970) ideas about taking the nucleotide bases 
as phonemes of the genetic code or about the bi- 
nary oppositions in phonemes and in the nucleic 
code. 

3. Biological ideas in linguistics: (1) the "tree 
model" proposed by Schleicher (1863); (2) the 
"wave model" due to Schmidt (1872); (3) the 
"geometric network model" proposed by Forster 
(1997); or (3) the naturalistic metaphor in Lin- 
guistics defended by Jakobson (1970, 1973). 

4. Using DNA as a support for computation is the 
basic idea of Molecular Computing (Paun et al., 
1998). Speculations about this possibility can be 
found in Feynman (1961), Bennett (1973) and 
Conrad (1995). 



BIOLOGICAL METHODS IN NLP 

Here, we present an overview of different bio-inspired 
methods that during the last years have been success- 
fully applied to several NLP issues, from syntax to 
pragmatics. Those methods are taken mainly from 
computer science and are basically the following: 
DNA computing, membrane computing and networks 
of evolutionary processors. 



DNA Computing 

One of the most developed lines of research in natural 
computing is the named molecular computing, a model 
based on molecular biology, which arose mainly after 
Adleman ( 1 994) . An active area in molecular computing 
is DNA computing (Paun et al., 1998) inspired in the 
way that DNAperf orm operations to generate, replicate 
or change the configuration of the strings. 

Application of molecular computing methods to 
natural language syntax gives rise to molecular syn- 
tax (Bel-Enguix & Jimenez-Lopez, 2005a). Molecular 
syntax takes as a model two types of mechanisms 
used in biology in order to modify or generate DNA 
sequences: mutations and splicing. Mutations refer to 
changes performed in a linguistic string, being this a 
phrase, sentence or text. Splicing is a process carried 
out involving two or more linguistic sequences. It is 
a good framework for approaching syntax, both from 
the sentential or dialogical perspective. 

Methods used by molecular syntax are based on 
basic genetic processes: cut, paste, delete and move. 
Combining these elementary rules most of the complex 
structures of natural language can be obtained, with a 
high degree of simplicity. 

This approach is a test of the generative power of 
splicing for syntax. It seems, according to the results 
achieved, that splicing is quite powerful for generat- 
ing, in a very simple way, most of the patterns of the 
traditional syntax. Moreover, the new perspectives and 
results it provides, could mean a transformation in the 
general perspective of syntax. 

From here, we think that bio-NLP, applied in a 
methodological and clear way, is a powerful and simple 
model that can be very useful to a) formulate some sys- 
tems capable of generating the larger part of structures 
of language, and b) define a formalization that can be 
implemented and may be able to describe and predict 
the behavior of natural language structures. 

Membrane Computing 

Membrane Systems (MS) (Paun, 2000) are models 
of computation inspired by some basic features of 
biological membranes. They can be viewed as a new 
paradigm in the field of natural computing based on the 
functioning of membranes inside the cell. MS can be 
used as generative, computing or decidability devices. 
This new computing model has several intrinsically 
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interesting features such as, for example, the use of 
multisets and the inherent parallelism in its evolution 
and the possibility of devising computations which can 
solve exponential problems in polynomial time. 
This framework provides a powerful tool for formal- 
izing any kind of interaction, both among agents and 
among agents and environment. One of key ideas of 
MS is that generation is made by evolution. Therefore, 
most of evolving systems can be formalized by means 
of membrane systems. 

Linguistic Membrane Systems (LMS) (Bel-Enguix 
& Jimenez-Lopez, 2005b) aim to model linguistic 
processes, taking advantage of the flexibility of MS 
and their suitability for dealing with some fields where 
contexts are a central part of the theory. LMS can be 
easily adapted to deal with different aspects of the 
description and processing of natural languages. The 
most developed applications of LMS are semantics 
and dialogue. 

MS are a good framework for developing a se- 
mantic theory because they are evolving systems by 
definition, in the same sense that we take meaning to 
be a dynamic entity. Moreover, MS provide a model 
in which contexts, either isolated or interacting, are 
an important element which is already formalized and 
can give us the theoretical tools we need. Semantic 
membranes may be seen as an integrative approach to 
semantics coming from formal languages, biology and 
linguistics. Taking into account results obtained in the 
field of computer science as well as the naturalness and 
simplicity of the formalism, it seems the formalization 
of contexts by means of membranes is a promising 
area of research for the future. Examples of application 
of MS to semantics can be found in Bel-Enguix and 
Jimenez-Lopez (2007). 

Atopic where context and interaction among agents 
is essential is the field of dialogue modeling and its ap- 
plications to the design of effective and user- friendly 
computer dialogue systems. Taking into account a 
pragmatic perspective of dialogue and based on speech 
acts, multi-agent theory and dialogue games, Dialogue 
Membrane Systems have arisen, as an attempt to com- 
pute speech acts by means of MS. Considering mem- 
branes as agents, and domains as a personal background 
and linguistic competence, the application to dialogue 
is almost natural, and simple from the formal point of 
view. For examples of this application see Bel-Enguix 
and Jimenez-Lopez (2006b). 



NEPS-Networks of Evolutionary 
Processors 

Networks of Evolutionary Processors (NEPs) are anew 
computing mechanism directly inspired in the behavior 
of cell populations. Every cell is described by a set of 
words (DNA) evolving by mutations, which are repre- 
sented by operations on these words. At the end of the 
process, only the cells with correct strings will survive. 
In spite of the biological inspiration, the architecture 
of the system is directly related to the Connection 
Machine (Hillis, 1985) and the Logic Flow paradigm 
(Errico et al. 1994). Moreover, the global framework 
for the development of NEPs has to be completed with 
the biological background of DNA computing (Paun 
et al., 1998), membrane computing (Paun, 2000) and, 
specially, with grammar systems (Csuhaj-Varju et. 
al., 1994), which share with NEPs the idea of several 
devices working together and exchanging results. 

First precedents of NEPs as generating devices 
can be found in Csuhaj-Varju & Salomaa (1997) and 
Csuhaj-Varju & Mitrana (2000). The topic was intro- 
duced in Castellanos et al. (2003) and Martin- Vide et 
al. (2003), and further developed in Castellanos et al. 
(2005), Csuhaj-Varju et al. (2005). 

With this background and theoretical connections, 
it is easy to understand how NEPs can be described 
as agential bio-inspired context-sensitive systems. 
Many disciplines are needed of these types of models 
that are able to support a biological framework in a 
collaborative environment. The conjunction of these 
features allows applying the system to a number of 
areas, beyond generation and recognition in formal 
language theory. NLP is one of the fields with a lack 
of biological models and with a clear suitability for 
agential approaches. 

NEPs have significant intrinsic multi-agent capa- 
bilities together with the environmental adaptability 
that is typical of bio-inspired models. Some of the 
characteristics of NEPs architecture are the following: 
Modularization, contextualization and redefinition of 
agent capabilities, synchronization, evolvability and 
learnability. 

Inside of the construct, every agent is autonomous, 
specialized, context-interactive and learning-ca- 
pable. 

In what refers to the functioning of NEPs, two 
main features deserve to be highlighted: emergence 
and parallelism. 
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Because of those features, NEPs seems to be a 
suitable model for tackling natural languages. One 
of the main problems of natural language is that it is 
generated in the brain, and there is a lack of knowledge 
of the mental processes the mind undergoes to bring 
about a sentence. While expecting new advances in 
neuro-science, we have to use models that seem to fit 
better to NLP. Modularity has shown to be an impor- 
tant idea in a wide range of fields: cognitive science, 
computer science and, of course, NLP. NEPs provide 
a suitable theoretical framework for formalization of 
modularity in NLP. 

Another chief problem for the formalization and 
processing of natural language is its changing nature. 
Not only words, but also rules, meaning and phonemes 
can take different shapes during the process of compu- 
tation. Formal models based in mathematical language 
have a lack of flexibility to describe natural language. 
Biological models seem to be better to this task, since 
biological entities share with languages the concept 
of "evolution". From this perspective, NEPs offer 
enough flexibility to model any change at any moment 
in any part of the system. Besides, as a bio-inspired 
method of computation, they have the capability of 
simulating natural evolution in a highly pertinent and 
specialized way. 

Some linguistic disciplines, as pragmatics or seman- 
tics, are context-driven areas, where the same utterance 
has different meanings in different contexts. To model 
such variation, a system with a good definition of envi- 
ronment is needed. NEPs offer some kind of solution 
to approach formal semantics and formal pragmatics 
from a natural computing perspective. 

Finally, the multimodal approach to communication, 
where not just production, but also gestures, vision and 
supra-segmental features of sounds have to be tackled, 
refers to a parallel way of processing. NEPs allow 
modules to work in parallel. The autonomy of every 
one of the processors and the possible miscoordina- 
tion between them can also give account of several 
problems of speech. 

Examples of NEPs applications to NLP can be found 
in Bel-Enguix and Jimenez-Lopez (2005c, 2006a). 



FUTURE TRENDS 

Three general formalisms for dealing with NLP by 
means of biological methods have been introduced, 



focusing on the formal definition of several frameworks 
that adapt models coming from the area of bio-inspired 
computation to NLP needs. The main trends for the 
future focus on the implementation of these models in 
order to test their computational advantages over clas- 
sical models of NLP without biological inspiration. 



CONCLUSION 

The coincidences between several structures of lan- 
guage and biology allow us, in the field of NLP, to 
take advantage of the bio-inspired models formalized 
by theoretical computer science. Moreover, the multi- 
agent capabilities of some of these models make them a 
suitable tool for simulating the processes of generation 
and recognition in natural language. 

Biological methods coming from computer science 
can be very useful in the field of natural language, since 
they provide simple, flexible and intuitive tools for 
describing natural languages and making easier their 
implementation in NLP systems. 

This research provides an integrative path for 
biology, computer science and NLP - three branches 
of human knowledge that have to be together in the 
development of new systems of communication for 
future global society. 
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KEY TERMS 



Membrane Systems: In a membrane system multi- 
sets of objects are placed in the compartments defined 
by the membrane structure that delimits the system from 
its environment. Each membrane identifies a region, 
the space between it and all directly inner membranes. 
Objects evolve by means of reaction rules associated 
with compartments, and applied in a maximally paral- 
lel, nondeterministic manner. Objects can pass through 
membranes, membranes can change their permeability, 
dissolve and divide. 

Multi- Agent System: A system composed of a set 
of computational agents that perform local problem 
solving and cooperatively interact to solve a single 
problem (or reach a goal) difficult to be solve (achieved) 
by an individual agent. 

Mutations: Several types of transformations in a 
single string. 

Natural Computing: Research field that deals with 
computational techniques inspired by nature and natural 
systems. This type of computing includes evolutionary 
algorithms, neural networks, molecular computing and 
quantum computing. 

Neural Network: Interconnected group of artificial 
neurons that uses a mathematical or a computational 
model for information processing based on a connec- 
tionist approach to computation. It involves a network 
of simple processing elements that can exhibit complex 
global behaviour. 

Splicing: Operation which consists of splitting up 
two strings in an arbitrary way and sticking the left 
side of the first one to the right side of the second one 
(direct splicing), and the left side of the second one to 
the right side of the first one (inverse splicing). 



Grammar Systems Theory: A consolidated and 
active branch in the field of formal languages that 
provides syntactic models for describing multi-agent 
systems at the symbolic level using tools from formal 
languages and grammars. 
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INTRODUCTION 

Natural language understanding and assessment is 
a subset of natural language processing (NLP). The 
primary purpose of natural language understanding 
algorithms is to convert written or spoken human lan- 
guage into representations that can be manipulated by 
computer programs. Complex learning environments 
such as intelligent tutoring systems (ITSs) often depend 
on natural language understanding for fast and accurate 
interpretation of human language so that the system can 
respond intelligently in natural language. These ITSs 
function by interpreting the meaning of student input, 
assessing the extent to which it manifests learning, and 
generating suitable feedback to the learner. To operate 
effectively, systems need to be fast enough to operate in 
the real time environments of ITSs. Delays in feedback 
caused by computational processing run the risk of frus- 
trating the user and leading to lower engagement with 
the system. At the same time, the accuracy of assessing 
student input is critical because inaccurate feedback 
can potentially compromise learning and lower the 
student's motivation and metacognitive awareness of 
the learning goals of the system (Millis et al., 2007). 
As such, student input in ITSs requires an assessment 
approach that is fast enough to operate in real time but 
accurate enough to provide appropriate evaluation. 

One of the ways in which ITSs with natural lan- 
guage understanding verify student input is through 
matching. In some cases, the match is between the user 
input and a pre-selected stored answer to a question, 
solution to a problem, misconception, or other form of 



benchmark response. In other cases, the system evalu- 
ates the degree to which the student input varies from 
a complex representation or a dynamically computed 
structure. The computation of matches and similarity 
metrics are limited by the fidelity and flexibility of the 
computational linguistics modules. 

The maj or challenge with assessing natural language 
input is that it is relatively unconstrained and rarely fol- 
lows brittle rules in its computation of spelling, syntax, 
and semantics (McCarthy et al., 2007). Researchers 
who have developed tutorial dialogue systems in natu- 
ral language have explored the accuracy of matching 
students ' written input to targeted knowledge . Examples 
of these systems are Auto Tutor and Why- Atlas, which 
tutor students on Newtonian physics (Graesser, Olney, 
Haynes, & Chipman, 2005; VanLehn , Graesser, et al., 
2007), and the iSTART system, which helps students 
read text at deeper levels (McNamara, Levinstein, 
& Boonthum, 2004). Systems such as these have 
typically relied on statistical representations, such as 
latent semantic analysis (LSA; Landauer, McNamara, 
Dennis, & Kintsch, 2007) and content word overlap 
metrics (McNamara, Boonthum, et al., 2007). Indeed, 
such statistical and word overlap algorithms can boast 
much success. However, over short dialogue exchanges 
(such as those in ITSs), the accuracy of interpretation 
can be seriously compromised without a deeper level 
of lexico-syntactic textual assessment (McCarthy et al., 
2007). Such a lexico-syntactic approach, entailment 
evaluation, is presented in this chapter. The approach 
incorporates deeper natural language processing solu- 
tions for ITSs with natural language exchanges while 
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remaining sufficiently fast to provide real time assess- 
ment of user input. 



BACKGROUND 

Entailment evaluations help in the assessment of the 
appropriateness of student responses during ITS ex- 
changes. Entailment can be distinguished from three 
similar terms (implicature, paraphrase, and elaboration), 
all of which are also important for assessment in ITS 
environments (McCarthy et al, 2007). 

The terms entailment is often associated with the 
highly similar concept of implicature. The distinction 
is that entailment is reserved for linguistic-based infer- 
ences that are closely tied to explicit words, syntactic 
constructions, and formal semantics, as opposed to the 
knowledge-based implied referents and references, for 
which the term implicature is more appropriate (Mc- 
Carthy et al., 2007). Implicature corresponds to the 
controlled knowledge-based elaborative inferences 
defined by Kintsch (1 993) or to knowledge-based infer- 
ences defined in the inference taxonomies in discourse 
psychology (Graesser, Singer, & Trabasso, 1994). 

The terms paraphrase and elaboration also need 
to be distinguished from entailment. Aparaphrase is a 
reasonable restatement of the text. Thus, a paraphrase 
is a form of entailment, yet an entailment is not neces- 
sarily a paraphrase. This asymmetric relation can be 
understood if we consider that John went to the store 
is entailed by (but not a paraphrase of) John drove to 
the store to buy supplies. The term elaboration refers 
to information that is generated inf erentially or associa- 
tively in response to the text being analyzed, but without 
the systematic and sometimes formal constraints of 
entailment, implicature, or paraphrase. Examples of 
each term are provided below for the sentence John 
drove to the store to buy supplies. 

Entailment: John went to the store. 

(Explicit, logical implication based on the text) 

Implicature: John bought some supplies. 
(Implicit, reasonable assumption from the text, although 
not explicitly stated in the text) 

Paraphrase: He took his car to the store to get things 
that he wanted. 



(Reasonable re-statement of all and only the critical 
information in the text) 

Elaboration: He could have borrowed stuff. 
(Reasonable reaction to the text) 

Evaluating entailment is generally referred to as the 
task of recognizing textual entailment (RTE; Dagan, 
Glickman, & Magnini, 2005). Specifically, it is the 
task of deciding, given two text fragments, whether the 
meaning of one text logically infers the other. When 
it does, the evaluation is deemed as T (the entailing 
text) entails H (the entailed hypothesis). For example, 
a text (from the RTE data) of Eyeing the huge market 
potential, currently led by Google, Yahoo took over 
search company Overture Services Inc last year would 
entail a hypothesis of Yahoo bought Overture. The task 
of recognizing entailment is relevant to a large number 
of applications, including machine translation, question 
answering, and information retrieval. 

The task of textual entailment has been a priority 
in investigations of information retrieval (Monz & 
de Rijke, 2001) and automated language processing 
(Pazienza,Pennacchiotti,&Zanzotto, 2005). In related 
work, Moldovan and Rus (2001) analyzed how to use 
unification and matching to address the answer correct- 
ness problem. Similar to entailment, answer correct- 
ness is the task of deciding whether candidate answers 
logically imply an ideal answer to a question. 



THE LEXICO-SYNTACTIC ENTAILMENT 
APPROACH 

A complete solution to the textual entailment challenge 
requires linguistic information, reasoning, and world 
knowledge (Rus, McCarthy, McNamara, & Graesser, 
in press). This chapter focuses on the role of linguis- 
tic information in making entailment decisions. The 
overall goal is to produce a light (i.e. computationally 
inexpensive), but accurate solution that could be used 
in interactive systems such as ITSs. Solutions that rely 
on processing-intensive deep representations (e.g., 
frame semantics and reasoning) and large structured 
repositories of information (e.g., ResearchCyc) are 
impractical for interactive tasks because they result in 
lengthy response times, causing user dissatisfaction. 
One solution for recognizing textual entailment is 
based on subsumption. In general, an obj ect X subsumes 
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an object Y if and only if X is more general than or 
identical to Y. Applied to textual entailment, subsump- 
tion translates as follows: hypothesis H is entailed from 
T if and only if T subsumes H. The solution has two 
phases: (I) map both T and H into graph structures and 
(II) perform a subsumption operation between the T- 
graph and H-graph. An entailment score, entail(T,H), is 
computed, quantifying the degree to which the T-graph 
subsumes the H-graph. 

In phase I, the two text fragments involved in a 
textual entailment decision are initially mapped onto 
a graph representation. The graph representation em- 
ployed is based on the dependency-graph formalisms 
of Mel'cuk (1998). The mapping relies on information 
from syntactic parse trees. Aphrase-based parser is used 
to derive the dependencies. Although a dependency- 
parser may be adopted, our particular research agenda 
required partial phrase parsers for other tasks such as 
computing cohesion metrics. Having a phrase-based 
and dependency parser integrated in the system would 
have led to a heavier, less interactive system. A parse 
tree groups words into phrases and organizes these 
phrases into hierarchical tree structures from which 
syntactic dependencies among concepts can be detected. 
The system uses Charniak's (2000) parser to obtain 
parse trees and Magerman's (1994) head-detection 
rules to obtain the head of each phrase. A dependency 
tree is generated by linking the head of each phrase 
to its modifiers in a systematic mapping process. The 
dependency tree encodes exclusively local dependen- 
cies (head-modifiers), as opposed to long-distance 
(remote) dependencies, such as the remote subject 
relation between bombers and enter in the sentence The 
bombers managed to enter the embassy compounds. 
Thus, in this stage, the dependency tree is transformed 
onto a dependency graph by generating remote depen- 
dencies between content words. Remote dependencies 
are computed by a naive-Bayes functional tagger (Rus 



& Desai, 2005). An example of a dependency graph 
is shown in Figure 1 for the sentence The two objects 
will cover the same horizontal distance. For instance, 
there is a subject (subj) dependency relation between 
objects and cover. 

Inphase II, the textual entailment problem (i.e., each 
T and H) is mapped into a specific example of graph 
isomorphism called subsumption (also known as con- 
tainment). Isomorphism in graph theory addresses the 
problem of testing whether two graphs are the same. 

A graph G = (V, E) consists of a set of nodes or 
vertices V and a set of edges E. Graphs can be used to 
model the linguistic information embedded in a sentence: 
vertices represent concepts (e.g., bomber sjointventure) 
and edges represent syntactic relations among concepts 
(e.g., the edge labeled subj connects the verb cover to 
its subject objects in Figure 1). The Text (T) entails the 
Hypothesis (H) if and only if the hypothesis graph is 
subsumed (or contained) by the text graph. 

The subsumption algorithm for textual entailment 
(Rus et al., in press) has three major steps: (1) find an 
isomorphism between V H (set of vertices of the Hy- 
pothesis graph) and V T ; (2) check whether the labeled 
edges in H, E H , have correspondents in E T ; and (3) 
compute score. In step 1, for each vertex V H , a cor- 
respondent V T node is sought. If a vertex in H does 
not have a direct correspondent in T, a thesaurus is 
used to find all possible synonyms for vertices. Step 2 
takes each relation in H and checks its presence in T. 
The checking is augmented with relation equivalences 
among linguistic phenomena such as possessives and 
linking verbs (e.g. be, have). For instance, tall man 
would be equivalent to man is tall. A normalized score 
for vertices and edge mapping is then computed. The 
score for the entire entailment is the sum of each in- 
dividual vertex and edge matching score. Finally, the 
score must account for negation. The approach handles 
both explicit and implicit negation. Explicit negation 
is indicated by particles such as no, not, neither ... nor 




Figure 1. An example of a dependency graph 
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The two obiects will cover the same horizontal distance. 
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and the shortened form n't. Implicit negation is pres- 
ent in text via deeper lexico-semantic relations among 
linguistic expressions. The most obvious example is the 
antonymy relation among words, which is retrieved from 
WordNet (Miller, 1 995). Negation is accommodated in 
the score after making the entailment decision for the 
Text-Hypothesis pair (without negation). If any one of 
the text fragments is negated, the decision is reversed, 
but if both are negated the decision is retained (double- 
negation), and so forth. 

Entailment for Intelligent Tutoring 
Systems 

The problem of evaluating student input in ITSs with 
natural language understanding is modeled here as a 
textual entailment problem. Results of this approach 
are shown on data sets from two ITSs: AutoTutor and 
iSTART. Data from the AutoTutor experiments involve 
college students learning Newtonian physics, whereas 
data from iSTART involve adolescent and college stu- 
dents constructing explanations about science texts. 

AutoTutor 

AutoTutor (autotutor.org) teaches topics such as New- 
tonian physics, computer literacy, and critical thinking 
by holding a dialogue in natural language with the 
student. The system presents deep-reasoning questions 
to the student that call for explanations or other elabo- 
rate answers. AutoTutor has a list of anticipated good 
answers (or expectations) and a list of misconceptions 
associated with each main question. AutoTutor guides 
the student in articulating the expectations through a 
number of dialogue moves and adaptively responds to 
the student by giving short feedback on the quality of 
student contributions. 

To understand how the entailment approach helps to 
assess the appropriateness of student responses in Au- 
toTutor, consider the following AutoTutor problem: 

Suppose a runner is running in a straight line at constant 
speed, and the runner throws a pumpkin straight up. 
Where will the pumpkin land? Explain why. 

An expectation for this problem is The object will 
continue to move at the same horizontal velocity as 
the person when it is thrown. A real student answer is 
The pumpkin and the runner have the same horizontal 



velocity before and after release. The expert judgment 
of this response was very good. Such expectation/stu- 
dent-input (E-S) pairs can be viewed as an entailment 
pair of Text-Hypothesis. The task is to find the truth 
value of the student answer based on the true fact 
encoded in the expectation. Rus and Graesser (2006) 
examined how the lexico-syntactic system described 
in the previous section performed on a test set of 125 
E-S pairs collected from a sample of AutoTutor tuto- 
rial dialogues. The lexico-syntactic approach provided 
the best accuracy (69%), whereas a Latent Semantic 
Analysis (LSA, Landauer etal., 2007) approach yielded 
an accuracy of 60%. Such a result illustrates the value 
of augmenting AutoTutor with lexico-syntactic natural 
language understanding. 

iSTART (Interactive Strategy Trainer for 
Active Reading and Thinking) 

The primary goal of iSTART (istartreading.com) is 
to help high school and college students learn to use 
reading comprehension strategies that support deeper 
understanding. iSTART s design combines the power 
of self-explanation in facilitating deep learning (Mc- 
Namara et al., 2004) with content-sensitive, interactive 
strategy training. The iSTART system helps students 
learn to self-explain using a variety of reading strate- 
gies (e.g., rewording the text, or paraphrasing; or 
elaborating on the text by linking textual content to 
what the reader already knows). The final stage of the 
iSTART process requires students to self-explain sen- 
tences from two short passages. Scaffolded feedback 
is provided to the students based on the quality of the 
student responses. 

The entailment evaluation has been used in two 
iSTART studies. In Rus et al. (2007), a corpus of 
iSTART self-explanation responses was evaluated by 
an array of textual evaluation measures. The results 
demonstrated that the entailment approach was the most 
powerful distinguishing index of the self-explanation 
categories (Entailer: F( 1,1228) = 25.05, p < .001 ; LSA: 
F(l,1228) = 2.98, p > .01). In McCarthy et al. (2007), 
iSTART self explanations were hand-coded for degree 
of entailment, paraphrase, versus elaboration. Once 
again, the entailment evaluation proved to be a more 
powerful predictor of these categories than traditional 
measures: for entailment, the Entailer was a significant 
predictor (t= 9.61, p < .001) and LSA was a marginal 
predictor (t = -1.90, p = .061); for elaboration and for 
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paraphrase the Entailer was again a significant predictor 
(t = -7.98, p < .001; t = 5.62, p < .001, respectively), 
whereas LSA results were not significant. 



FUTURE TRENDS 

While the results of the entailment evaluation have 
been encouraging, a variety of developments of the 
approach are underway. For example, there are plans to 
weight words by their specificity and to learn syntactic 
patterns or transformations that lead to similar mean- 
ings. The current negation detection algorithm will be 
extended to assess plausible implicit forms of negation 
in words such as denied, denies, without, ruled out. A 
second extension addresses issues of relative opposites : 
knowing that an object is not hot does not entail that 
the object is cold (i.e., it could simply be warm). 



CONCLUSION 

Recognizing and assessing textual entailment is a 
prominent and challenging task in the fields of Natural 
Language Processing and Artificial Intelligence. This 
chapter presented a lexico-syntactic approach to the task 
of evaluating entailment. The approach is light, using 
minimal knowledge resources, yet it has delivered high 
performance in evaluations of three data sets involving 
natural language interactions in ITSs. The entailment 
approach is a promising step in achieving the goal of 
fast and effective evaluation of student contributions 
in short text exchanges, which is needed to provide 
optimal feedback and responses to student learners. 
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KEY TERMS 

Dependency: Binary relations between words in 
a sentence whose label indicates the syntactic relation 
among the two words. 

Entailment: The task of deciding whether a text 
fragment logically or semantically infers another text 
fragment. 

Expectation: A stored (generally ideal) answer to 
a problem, against which input is evaluated; concept 
used in ITSs. 

Graph Subsumption: A specific example of graph 
isomorphism. Isomorphism exists when two graphs are 
equivalent. Subsumption can be viewed as subgraph 
isomorphism. 

Intelligent Tutoring System: Interactive, feedback- 
based computer systems designed to help students learn 
various topics. 

Latent Semantic Analysis: A statistical technique 
for human language understanding based on words that 
co-occur in documents of large corpora. 

Natural Language Processing: The science of 
capturing the meaning of human language in compu- 
tational representations and algorithms. 

Natural Language Understanding and Assess- 
ment: An NLP subset focusing on evaluating natural 
language input in intelligent tutoring systems. 

Syntactic Parsing: The process of discovering the 
underlying structure of sentences. 
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INTRODUCTION 

Almost all autonomous robots need to navigate. We 
define navigation as do Franz & Mallot (2000): "Navi- 
gation is the process of determining and maintaining 
a course or trajectory to a goal location" (p. 134). We 
allow that this definition may be more restrictive than 
some readers are used to - it does not for example include 
problems like obstacle avoidance and position tracking 
- but it suits our purposes here. 

Most algorithms published in the robotics literature 
localise in order to navigate (see e.g. Leonard & Dur- 
rant-Whyte ( 1 99 1 a)). That is, they determine their own 
location and the position of the goal in some suitable 
coordinate system. This approach is problematic for 
several reasons. Localisation requires a map of avail- 
able landmarks (i.e. a list of landmark locations in 
some suitable coordinate system) and a description 
of those landmarks. In early work, the human opera- 
tor provided the robot with a map of its environment. 
Researchers have recently, though, developed simulta- 
neous localisation and mapping (SLAM) algorithms 
which allow robots to learn environmental maps while 
navigating (Leonard & Durrant-Whyte (1991b)). Of 
course, autonomous SLAM algorithms must choose 
which landmarks to map and sense these landmarks from 
a variety of different positions and orientations. Given 
a map, the robot has to associate sensed landmarks with 
those on the map. This data association problem is 
difficult in cluttered real-world environments and is an 
area of active research. 

We describe in this chapter an alternative approach 
to navigation called visual homing which makes no ex- 
plicit attempt to localise and thus requires no landmark 
map. There are broadly two types of visual homing 
algorithms: feature-based and image-based. The feature- 
based algorithms, as the name implies, attempt to extract 
the same features from multiple images and use the change 
in the appearance of corresponding features to navigate. 
Feature correspondence is - like data association - a 
difficult, open problem in real- world environments. 
We argue that image-based homing algorithms, which 



provide navigation information based on whole-image 
comparisons, are more suitable for real-world environ- 
ments in contemporary robotics. 



BACKGROUND 

Visual homing algorithms make no attempt to localise in 
order to navigate. No map is therefore required. Instead, an 
image I s (usually called a snapshot for historical reasons) 
is captured at a goal location S = (x s , y^. Note that 
though S is defined as a point on a plane, most homing 
algorithms can be easily extended to three dimensions 
(see e.g. Zeil et al. (2003)) . When a homing robot seeks 
to return to S from a nearby position C = (x c ,y c ), it takes 
an image I c and compares it with I s . The home vector H 
= S - C is inferred from the disparity between I s and I c 
(vectors are in upper case and bold in this work). The 
robot's orientation at C and S is often different; if this 
is the case, image disparity is meaningful only if I c is 
rotated to account for this difference. Visual homing 
algorithms differ in how this disparity is computed. 

Visual homing is an iterative process. The home vector 
H is frequently inaccurate, leading the robot closer to 
the goal position but not directly to it. If H does not 
take the robot to the goal, another image I c is taken at 
the robot's new position and the process is repeated. 

The images I s and I c are typically panoramic gray- 
scale images. Panoramic images are useful because, 
for a given location (x,y) they contain the same image 
information regardless of the robot's orientation. Most 
researchers use a camera imaging a hemispheric, coni- 
cal or paraboloid mirror to create these images (see e.g. 
Nayar (1997)). 

Some visual homing algorithms extract features 
from I s and I c and use these to compute image dis- 
parity. Alternatively, disparity can be computed from 
entire images, essentially treating each pixel as a viable 
feature. Both feature -based and image-based visual 
homing algorithms are discussed below. 
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FEATURE-BASED VISUAL HOMING 

Feature-based visual homing methods segment I s and 
I c into features and background (the feature extraction 
problem). Each identified feature in the snapshot is 
then usually paired with one feature in I c (the corre- 
spondence problem). The home vector is inferred from 

- depending on the algorithm - the change in the bearing 
and/or apparent size of the paired features. Generally, 
in order for feature-based homing algorithms to work 
properly, they must reliably solve the feature extraction 
and correspondence problems. 

The Snapshot Model (Cartwright & Collett (1983)) 

- the first visual homing algorithm to appear in the literature 
and the source of the term "snapshot" to describe the goal 
image - matches each snapshot feature with the current 
feature closest in bearing (after both images are rotated 
to the same external compass orientation). Features in 
(Cartwright & Collett (1983)) were black cylinders in 
an otherwise empty environment. Two unit vectors, 
one radial and the other tangential, are associated with 
each feature pair. The radial vector is parallel to the 
bearing of the snapshot feature; the tangential vector is 
perpendicular to the radial vector. The direction of the 
radial vector is chosen to move the agent so as to reduce 
the discrepancy in apparent size between paired features . 
The direction of the tangential vector is chosen to move 
the agent so as to reduce the discrepancy in bearing be- 
tween paired features. The radial and tangential vectors 
for all feature pairs are averaged to produce a homing 
vector. The Snapshot Model was devised to explain the 
behaviour of nest-seeking honeybees but has inspired 
several robotic visual homing algorithms. 

One such algorithm is the Average Landmark Vector 
(ALV) Model (Moller et al. (2001)). The ALV Model, 
like the Snapshot Model, extracts features from both I 
andl s . The ALV Model, though, does not explicitly solve 
the correspondence problem. Instead, given features 
extracted from I s , the algorithm computes and stores a unit 
vector ALV S in the direction of the mean bearing to all 
features as seen from S. At C, the algorithm extracts 
features from I c and computes their mean bearing, 
encoded in the unit vector ALV C . The home vector H 
is defined as ALV C -ALV S . Figure 1 illustrates home 
vector computation for a simple environment with four 
easily discernible landmarks. 

Several other interesting feature -based homing al- 
gorithms can be found in the literature. Unfortunately, 
space constraints prevent us from reviewing them here. 



Two algorithms of note are: visual homing by "surfing 
the epipoles" (Basri et al. (1998) and the Proportional 
Vector Model (Lambrinos et al. (2000)). 

The Snapshot and ALV Models were tested by their 
creators in environments in which features contrasted 
highly with background and so were easy to extract. 
How is feature extraction and correspondence solved 
in real-world cluttered environments? One method is 
described in Gourichon et al. (2002). The authors use 
images converted to the HSV (Hue-Saturation- Value) 
colour space which is reported to be more resilient to 
illumination change than RGB. Features are defined 
as image regions of approximately equal colour (identi- 
fied using a computationally expensive region-growing 
technique). Potential feature pairs are scored on their 
difference in average hue, average saturation, average 
intensity and bearing. The algorithm searches for a set 
of pairings which maximise the sum of individual match 
scores. The pairing scheme requires 0(n 2 ) pair-score 
computations (where n is the number of features). The 
algorithm is sometimes fooled by features with similar 
colours (specifically, pairing a blue chair in the snapshot 
image with a blue door in the current image). Gourichon 
et al. did not explore environments with changing light- 
ing conditions. 

Several other methods feature extraction and corre- 
spondence algorithms appear in the literature; see e.g. 
Rizzi et al. (200 1 ), Lehrer & Bianco (2000) and Gaussier 
et al. (2000). Many of these suffer from some of the 
same problems as the algorithm of Gourichon et al. 
described above. The appearance of several compet- 
ing feature extraction and correspondence algorithms 
in recent publications indicates that these are open 
and difficult problems; this is why we are advocating 
image-based homing in this chapter. 



Figure 1. Illustration of Average Landmark Vector 
computation. See Section titled "Feature-based Visual 
Homing" for details 
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IMAGE-BASED VISUAL HOMING 

Feature-based visual homing algorithms require consistent 
feature extraction and correspondence over a variety of 
viewing positions. Both of these are still open prob- 
lems in computer vision. Existing solutions are often 
computationally intensive. Image-based visual hom- 
ing algorithms avoid these problems altogether. They 
infer image disparity from entire images; no pixel is 
disregarded. We believe that these algorithms present a 
more viable option for real-world, real-time robotics. 
Three image-based visual homing algorithms have 
been published so far; we describe these below. 

Image Warping 

The image warping algorithm (Franz et al. (1998)) 
asks the following question: When the robot is at C 
in some unknown orientation, what change in orien- 
tation and position is required to transform I c into I s 
? The robot needs to know the distance to all imaged 
objects in I s to answer this question precisely. Not 
having this information, the image warping algorithm 
makes the assumption that all objects are at an equal 
(though unknown) distance from S. The algorithm 
searches for the values of position and orientation 
change which minimises the mean-square error between 
a transformed I c and I s . Since the mean square error 
function is rife with local minima, the authors resort 
to a brute force search over all permissible values of 
position and orientation change. 

Unlikely as the equal distance assumption is, the 
algorithm frequently results in quite accurate values 
for H. Unlike most visual homing schemes, image 
warping requires no external compass reference. 
Unfortunately, the brute force search for the homing 
vector and the large number of transformations of I c 
carried out during this search make image warping 
quite computationally expensive. 

Homing with Optic Flow Techniques 

When an imaging system moves from S to C , the im- 
age of a particular point in space moves from I s (x,y) to 
I c (x \y '). This movement is called optic flow and (x-x ', 
y- y ') is the so called pixel displacement vector. Vardy 
& Moller (2005) demonstrate that the home vector H 
can be inferred from a single displacement vector so 
long as the navigating robot is constrained to move on 



a single plane. Several noisy displacement vectors can 
be combined to estimate H. 

Vardy & Moller (2005) describe a number of meth- 
ods, adapted from the optic flow literature, to estimate 
the displacement vector. One of the most successful 
methods - BlockMatch - segments the snapshot image 
into several equal-sized subimages. The algorithm then 
does a brute force search of a subset of I c to find the 
best match for each subimage. A displacement vector 
is computed from the centre of each subimage to the 
centre of its match pair in I . 

Aless computationally intensive algorithm estimates 
the displacement vector from the intensity gradient at 
each pixel in I . The intensity gradient at a particular 
pixel can be computed straightforwardly from intensi- 
ties surrounding that pixel. No brute-force search is 
required. 

In comparative tests, Vardy & Moller demonstrated 
that their optic flow based methods perform consistently 
better than image warping in several unadulterated 
indoor environments. A drawback to the optic flow 
homing methods is that the robot is constrained to 
move on a single plane. The authors do not provide 
a way to extend their algorithm to three dimensional 
visual homing. 

Surfing the Difference Surface 

Zeil et al. (2003) describe a property of natural scenes 
which can be exploited for visual homing: as the Eu- 
clidean distance between S and C increases, the pixel- 
by-pixel root mean square (RMS) difference between I s 
and I increases smoothly and monotonically. Labrosse 
and Mitchell discovered this phenomenon as well; see 
Mitchell & Labrosse (2004). Zeil et al. reported that 
the increase in the RMS signal was discernible from 
noise up to about three meters from S in their outdoor 
test environment; they call this region the catchment 
area. 

RMS, when evaluated at locations in a subset of 
the plane surrounding S, forms a mathematical surface, 
the difference surface. A sample difference surface is 
shown in Figure 2(a) (see caption for details). 

Zeil et al. describe a simple algorithm to home using 
the RMS difference surface. Their "Run-Down" algo- 
rithm directs the robot to move in its current direction 
while periodically sampling the RMS signal. When the 
current sample is greater than the previous, the robot 
is made to stop and turn ninety degrees (clockwise or 
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Figure 2. Two difference surfaces formed using the RMS image similarity measure. Both the surfaces and 
their contours are shown. In each case, the snapshot I s was captured at x=150cm, y=150cm in a laboratory 
environment, (a) The snapshot was captured in the same illumination conditions as all other images. Notice 
the global minimum at the goal location and the absence of local minima, (b) Here we use the same snapshot 
image as in (a) but the lighting source has changed in all other images. The global minimum no longer appears 
at the goal location. When different goal locations were used, we observed qualitatively similar disturbances in 
the difference surfaces formed. The images used were taken from a database provided by Andrew Vardy which 
is described in Vardy & Moller (2005). 
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counter-clockwise, it does not matter). It then repeats 
the process in this new direction. The agent stops when 
the RMS signal falls below a pre-determined threshold. 
We have explored a biologically inspired difference 
surface homing method which was more successful 
than "Run-Down" in certain situations (Zampoglou 
et al. (2006)). 

Unlike the optic flow methods described in the 
previous section, visual homing by optimising the dif- 
ference surface is easily extensible to three dimensions 
(Zeil et al. (2003)). 

Unfortunately, when lighting conditions change 
between capture of I s and I c , the minimum of the 
RMS difference surface often fails to coincide with S, 
making homing impossible (Figure 2(b)). 



FUTURE TRENDS 

No work has yet been published comparing the efficacy 
of the image-based homing algorithms described above. 
This would seem the logical next step for image-based 
homing researchers. As we mentioned in the section 
titled "Surfing the Difference Surface," the difference 
surface is disrupted by changes in lighting between 
captures of I s and I c This problem obviously demands 
a solution and is a focus of our current research. Finally, 
it would be interesting to compare standard map-based 
navigation algorithms with the image-based visual 
homing methods presented here. 



CONCLUSION 

Visual homing algorithms - unlike most of the naviga- 
tion algorithms found in the robotics literature - do not 
require a detailed map of their environment. This is 
because they make no attempt to explicitly infer their 
location with respect to the goal. These algorithms 
instead infer the home vector from the discrepancy be- 
tween a stored snapshot image taken at the goal position 
and an image captured at their current location. 

We reviewed two types of visual homing algorithms : 
feature -based and image-based. We argued that im- 
age-based algorithms are preferable because they make 
no attempt to solve the tough problems of consistent 
feature extraction and correspondence - solutions to 
which feature-based algorithms demand. Of the three 



image-based algorithms reviewed, image warping is 
probably not practicable due to the computationally 
demanding brute force search required. Work is required 
to determine which of the two remaining image-based 
algorithms is more effective for robot homing in real- 
world environments. 
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KEY TERMS 

Catchment Area: The area from which a goal 
location is reachable using a particular navigation 
algorithm. 

Correspondence Problem: The problem of pairing 
an imaged feature extracted from one image with the 
same imaged feature extracted from a second image. 
The images may have been taken from different loca- 
tions, changing the appearance of the features. 

Image-based Visual Homing: Visual homing (see 
definition below) in which the home vector is estimated 
from the whole-image disparity between snapshot and 
current images . No feature extraction or correspondence 
is required. 

Feature Extraction Problem: The problem of 
extracting the same imaged features from two images 
taken from (potentially) different locations. 

Navigation: The process of determining and main- 
taining a course or trajectory to a goal location. 

Optic Flow: The perceived movement of objects 
due to viewer translation and/or rotation. 

Snapshot Image: In the visual homing literature, 
this is the image captured at the goal location. 

Visual Homing: A method of navigating in which 
the relative location of the goal is inferred by compar- 
ing an image taken at the goal with the current image. 
No landmark map is required. 
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INTRODUCTION 

Real world optimization problems are often too complex 
to be solved through analytic means . Evolutionary algo- 
rithms are a class of algorithms that borrow paradigms 
from nature to address them. These are stochastic 
methods of optimization that maintain a population 
of individual solutions, which correspond to points 
in the search space of the problem. These algorithms 
have been immensely popular as they are derivative- 
free techniques, are not as prone to getting trapped in 
local minima, and can be tailored specifically to suit 
any given problem. The performance of evolutionary 
algorithms can be improved further by adding a local 
search component to them. The Nelder-Mead simplex 
algorithm (Nelder & Mead, 1965) is a simple local 
search algorithm that has been routinely applied to 
improve the search process in evolutionary algorithms, 
and such a strategy has met with great success. 

In this article, we provide an overview of the various 
strategies that have been adopted to hybridize two well- 
known evolutionary algorithms - genetic algorithms 
(GA) and particle swarm optimization (PSO). 



BACKGROUND 

Arguably, GAs are one of the most of all common 
population based approaches for optimization. The 
population of candidate solutions that these algorithms 
maintain in each generation are called chromosomes. 
GAs carry out the Darwinian operators of selection, 
mutation, and recombination, on these chromosomes, 
to perform their search (Mitchell, 1998). Each genera- 
tion is improved by removing the poorer solutions from 
the population, while retaining the better ones, based 
on a fitness measure. This process is called selection. 
Following selection, a method of recombining solu- 
tions called crossover is applied. Here two (or more) 
parent solutions from the current generation are picked 
randomly for producing offspring to populate the next 
generation of solutions. The offspring chromosomes 



are then probabilistically subject to mutation, which 
is carried out by the addition of small random pertur- 
bations. 

PSO is a more recent approach for optimization 
(Kennedy & Eberhart, 2001). Being modelled after the 
social behavior of organisms such as a flock of birds 
in flight or a school offish swimming, it is considered 
an evolutionary algorithm only in a loose sense. Each 
solution within the population is called a particle in 
PSO. Each such particle's position in the search space 
is constantly updated within each generation, by the 
addition of the particle's velocity to it. The velocity 
of a particle is then adjusted towards the best position 
encountered in the particle's own history (individual 
best), as well as the best position in the current itera- 
tion (global best). 

Since evolutionary algorithms use a population of 
individuals and randomized variational operators, they 
are adept at performing exploratory searches over their 
search spaces. However, when the aim is to produce 
outputs within reasonable time limits, it is important 
to balance this exploration with better exploitation of 
smaller-scale features in the fitness landscape. In the 
latter context, local search algorithms enable single 
solutions to be improved using local information (e.g., 
directional trends in fitness around each solution) and 
take the solution towards the closest maximum fitness. 
Hybrid algorithms that combine the advantages of ex- 
ploration and exploitation comprise of a distinct area of 
evolutionary computation research that have been vari- 
ously called as Lamarckian or memetic approaches, of 
which Nelder-Mead hybrids are a significant chunk. 



NELDER-MEAD SIMPLEX BASED 
HYBRIDIZATION 

The Nelder-Mead Downhill Simplex 
Algorithm 

The Nelder-Mead simplex algorithm is a derivative-free 
local search technique that is capable of moving a cluster 
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Figure 1. Various operations in the Nelder-Mead simplex routine 
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of solutions in the gradient direction and which, as per 
current research, can be very effectively combined with 
GA and PSO approaches. These hybrid evolutionary 
algorithms have been shown to be very successful in 
continuous optimization problems. 

The Nelder-Mead smplex method makes use of a 
construct called a simplex (see Figure 1.). When the 
search space is n-dimensional, the simplex consists of 
n+1 solutions, s., z = {1, 2, ..., n+1}, that are usually 
closely spaced. As shown in the top left of Figure 1., in 
a two-dimensional search plane, a simplex is a triangle. 
The fitness of each solution is considered in each step 
of the Nelder-Mead method, and the worst solution w is 
identified. The centroid, c, of the remaining n points 
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r, as shown in the middle left of the figure. On the other 
hand, if r has a low fitness compared to the others, the 
simplex is contracted. Contraction can be either outward 
or inward depending upon whether r is better or worse 
than w. The contraction operations are shown in the 
middle right and bottom left of the figure. If neither 
contraction improves the worst solution in the simplex, 
the best point in the simplex is computed, and a collapse 
is then carried out, and all the points of the simplex are 
moved a little closer towards the best one, as shown in 
the bottom right of the same figure. 

The approaches taken to incorporate a simplex-based 
local search routine within the broad framework of a 
genetic algorithm fall under four different schemes that 
are shown in Figure 2. These are as follows: 

Two-Phase Hybridization 



is computed and the reflection of w along it determined. 
This reflection yields a new solution r that replaces w, 
in the next step, as shown in the top right of Figure 
1. If the solution r produced by this reflection has a 
higher fitness than any other solution in the simplex, 
the simplex is further expanded along the direction of 



This is the simplest of all approaches and has been 
applied to GAs (Chelouah & Siarry, 2000, Chelouah 
& Siarry, 2003, Robin, Orzati, Moreno, Homan & 
Bachtold, 2003). In the first phase in this scheme, a GA 
is applied to the optimization problem to explore the 
entire search space until one or more good solutions 



1192 



Nelder-Mead Evolutionary Hybrid Algorithms 



are found, which can no longer be improved through 
the random operations of crossover and mutation. The 
Nelder-Mead simplex algorithm is then invoked in the 
second phase to further improve the solutions by al- 
lowing them to ascend towards their local maxima. In 
another approach the initial points of the simplex are 
obtained by taking the solution with the best fitness 
given by the GA, and then generating the remaining n 
points around it (Chelouah & Siarry, 2000, Chelouah 
& Siarry, 2003). 

Serial Hybridization 

In this scheme, the solutions in each generation are 
subject to the usual operators of the main evolution- 
ary algorithm as well as the one or more steps of the 
Nelder-Mead simplex method. It has been successfully 
applied to hybridize GAs (Renders & Flasse, 1998, 
Yang & Douglas, 1998, Durand & Alliot, 1999, Guo 
& Shouyi, 2003, Trabia, 2004). This method has also 
been used in conjunction with PSO by Das et al. (Das, 
Koduru, Welch, Gui, Cochran, Wareing & Babin, 2006, 
Koduru, Welch, Das, 2007). In each generation, follow- 
ing the position and velocity updates, the population is 



clustered into distinct clusters of n+1 solutions each, 
and a few steps of the Nelder-Mead algorithm applied 
separately to each cluster. The Nelder-Mead step is 
applied a fixed number of times per generation. 

The serial hybridization scheme has been success- 
fully implemented within a multi-obj ective optimization 
framework also (Koduru, Das, Welch 2007). Instead 
of fitness, a metric called fuzzy dominance is applied 
to discriminate between the n+1 solutions within a 
simplex. A solution that is not dominated by any other 
is assigned a fuzzy dominance of zero. The poorer a 
solution is, the higher the fuzzy dominance value it is 
assigned. 

Parallel Hybridization 

Such hybridization approaches assemble the offspring 
generation from the parent generation in two parallel 
tracks. The standard evolutionary algorithm operators 
are used to generate some of the offspring, while oth- 
ers are generated using the simplex algorithm. This 
strategy is applied to hybridize GAs (Yen, Liao, Lee 
& Randolph, 1998, Koduru, Das, Welch & Roe, 2004, 
Koduru, Das, Welch, Roe & Lopez-Dee, 2005). In these 




Figure 2. Four different GA-simplex hybridization strategies 
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approaches, the best n+ 1 solutions (called elites) of each 
generation are picked to be improved further, using the 
Nelder-Mead simplex method. In another strategy, a 
probabilistic variant of the Nelder-Mead approach is 
used, where the amount of contraction and/or expan- 
sion of the simplex is determined randomly, but within 
specific limits (Yen, Liao, Lee & Randolph, 1998). 
The approaches taken in (Koduru, Das, Welch & Roe, 

2004) and (Koduru, Das, Welch, Roe & Lopez-Dee, 

2005) are multi-objective implementations that make 
use of the fuzzy dominance metric discussed earlier 
to identify the best and worst solutions. In order to 
preserve solution diversity within the population, the 
collapse operation is never used and the Nelder-Mead 
routine is terminated instead, when the need for one 
arises, within each generation. 

This scheme has been used with PSO (Fan, Liang 
& Zahara 2004, Zahara, Fan & Tsai, 2005). As earlier, 
only the best n+1 points of the population are picked to 
undergo improvement using the Nelder-Mead simplex 
method. The remaining solutions in each generation 
are obtained using standard PSO position and veloc- 
ity updates. 

Implicit Hybridization 

Here, the Nelder-Mead simplex algorithm is not applied 
directly. Instead, the approach is buried implicitly within 
any of the evolutionary algorithm's generic operators. 
One simple technique in GAs is the multi-parent sim- 
plex-based crossover (Renders & Bersini, 1994). This 
method applies a single Nelder-Mead step to produce 
a new offspring. Novel crossover techniques are also 
suggested (Bersini, 2002). In another method, each 
simplex is encoded as a chromosome and the algorithm 
uses a specially devised multi -parent crossover within 
the GA (Hedar & Fukushima, 2003). 

Das et ctl. (Das, Koduru, Welch, Gui, Cochran, 
Wareing & Babin, 2006) use implicit hybridization in 
PSO by adding a term to the velocity of each particle 
that allows the latter to reorient its trajectory towards 
gradient direction sensed by the Nelder-Mead simplex 
(from the worst towards the centroid). 



most practical problems in engineering are inherently 
multi-objective innature. Consequently, multi-objective 
evolutionary optimization is a relatively new, emerging 
direction of evolutionary computation research. Perhaps 
the only attempts at incorporating Nelder-Mead sim- 
plex as an additional operator within a GA have been 
reported by Koduru, Das & Welch (cf. Koduru, Das, 
Welch, Roe, 2004). Clearly more research is required 
in this direction, and as multi-objective algorithms 
become more common, Nelder-Mead strategies will 
be investigated more vigorously. 

PSO is a new technique for evolutionary optimiza- 
tion. Research into PSO-based hybrid algorithms has 
only recently begun to make its appearance. A few 
limited approaches have been suggested to hybridize 
PSO with Nelder-Mead simplex by Das et al. (Das, 
Koduru, Welch, Gui, Cochran, Wareing & Babin, 2006) 
and Zahara et a/., (Zahara, Fan & Tsai, 2005). The 
method suggested in (Koduru, Das, Welch 2007) is, to 
the best of the author's knowledge, the only attempt 
at producing a multi-objective PSO hybrid algorithm. 
Here again, further investigation is necessary. 

Although research into these evolutionary hybrid 
algorithms is over a decade old, with several good 
approaches having been suggested, there is no clear 
consensus about which approach is best suited for any 
given application. More research in this direction is 
warranted to obtain further insights into the performance 
of these algorithms. 



CONCLUSION 

In the literature on evolutionary optimization, many 
effective approaches have been proposed to hybrid- 
ize GAs with Nelder-Mead simplex. More recently, 
researchers have begun implementing similar ideas 
within PSO also. A few papers on multi-objective 
hybrid approaches have been published. However, a 
formal framework to categorize all these approaches 
has so far been lacking. This chapter surveys the vari- 
ous methods and proposes a way to organize them into 
four distinct categories. 



FUTURE TRENDS 
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KEY TERMS 

Dominance: A relationship between solutions in 
a multi-objective optimization problem. A solution 
dominates another if and only if it is equal to the latter 
in all the objectives and better in at least one. 

Evolutionary Algorithm: A class of probabilistic 
algorithms that are based upon biological metaphors 
such as Darwinian evolution, and widely used in op- 
timization. 

Exploration: A strategy that samples the fitness 
landscape extensively to obtain good regions. 

Exploitation: A greedy strategy that seeks to im- 
prove one or more solutions to an optimization problem 
to take it to a maximum in its vicinity. 

Fitness: A measure to determine the goodness of 
a solution to an optimization problem. When a single 
objective is to be maximized, the fitness is either 
equal to the objective or a monotonically increasing 
function of it. 

Fitness Landscape: A representation of the search 
space of an optimization problem that brings out the 
differences in the fitness of the solutions, such that those 
with good fitness are "higher". Optimal solutions are 
the maxima of the fitness landscape. 



Generation: A term used in evolutionary algo- 
rithms that corresponds to an iteration of the outermost 
loop. 

Local Search: A search algorithm to carry out 
exploitation. 

Multi-Objective Optimization: An optimization 
problem involving more than a single objective func- 
tion. In such a setting, it is not easy to discriminate 
between good and bad solutions, as a solution that is 
better than another in one objective may be poorer in 
another. Without any loss of generality, each objec- 
tive function can be considered to be one involving 
maximization. 

Population-Based Algorithm: An algorithm, 
which maintains an entire set of candidate solutions 
for an optimization problem. 

Search Space: Set of all possible solutions for any 
given optimization problem, in which one can usually 
define a neighborhood of any solution. 
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INTRODUCTION 

Neural networks have been used in a number of robotic 
applications (Das & Kar, 2006; Fierro & Lewis, 1998), 
including both manipulators and mobile robots. Atypi- 
cal approach is to use neural networks for nonlinear 
system modelling, including for instance the learning 
of forward and inverse models of a plant, noise cancel- 
lation, and other forms of nonlinear control (Fierro & 
Lewis, 1998). 

An alternative approach is to solve a particular 
problem by designing a specialized neural network 
architecture and/or learning rule (Sutton & Barto, 1981). 
It is clear that biological brains, though exhibiting a 
certain degree of homogeneity, rely on many specialized 
circuits designed to solve particular problems. 

We are interested in understanding how animals 
are able to solve complex problems such as learning 
to navigate in an unknown environment, with the aim 
of applying what is learned of biology to the control 
of robots (Chang & Gaudiano, 1998; Martinez-Marin, 
2007; Montes-Gonzalez, Santos-Reyes & Rios- 
Figueroa, 2006). 

In particular, this article presents a neural architec- 
ture that makes possible the integration of a kinematical 
adaptive neuro-controller for trajectory tracking and 
an obstacle avoidance adaptive neuro-controller for 
nonholonomic mobile robots. The kinematical adap- 
tive neuro-controller is a real-time, unsupervised 
neural network that learns to control a nonholonomic 
mobile robot in a nonstationary environment, which 
is termed Self-Organization Direction Mapping Net- 
work (SODMN), and combines associative learning 
and Vector Associative Map (VAM) learning to gen- 
erate transformations between spatial and velocity 



coordinates (Garcia-Cordova, Guerrero-Gonzalez & 
Garcia-Marin, 2007). The transformations are learned 
in an unsupervised training phase, during which the 
robot moves as a result of randomly selected wheel 
velocities. The obstacle avoidance adaptive neuro- 
controller is a neural network that learns to control 
avoidance behaviours in a mobile robot based on a 
form of animal learning known as operant conditioning. 
Learning, which requires no supervision, takes place as 
the robot moves around a cluttered environment with 
obstacles. The neural network requires no knowledge 
of the geometry of the robot or of the quality, number, 
or configuration of the robot's sensors. The efficacy of 
the proposed neural architecture is tested experimentally 
by a differentially driven mobile robot. 



BACKGROUND 

Several heuristic approaches based on neural networks 
(NNs) have been proposed for identification and adap- 
tive control of nonlinear dynamic systems (Fierro & 
Lewis, 1998; Pardo-Ayala & Angulo-Bahon, 2007). 

In wheeled mobile robots (WMR), the trajectory- 
tracking problem with exponential convergence has 
been solved theoretically using time- varying state feed- 
back based on the backstepping technique in (Ping & 
Nijmeijer, 1 997; Das & Kar, 2006). Dynamic feedback 
linearization has been used for trajectory tracking and 
posture stabilization of mobile robot systems in chained 
form (Oriolo, Luca & Vendittelli, 2002). 

The study of autonomous behaviour has become an 
active research area in the field of robotics. Even the 
simplest organisms are capable of behavioural feats un- 
imaginable for the most sophisticated machines. When 
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an animal has to operate in an unknown environment 
it must somehow learn to predict the consequences 
of its own actions. Biological organisms are a clear 
example that this short of learning is possible in spite 
of what, from an engineering standpoint, seem to be 
insurmountable difficulties: noisy sensors, unknown 
kinematics and dynamics, nostationary statistics, and 
so on. A related form of learning is known as operant 
conditioning (Grossberg, 1971). Chang and Gaudiano 
(1998) introduce a neural network for obstacle avoid- 
ance that is based on a model of classical and operant 
conditioning. 

Psychologists have identified classical and operant 
conditioning as two primary forms of learning that 
enables animals to acquire the causal structure of their 
environment. In the classical conditioning paradigm, 
learning occurs by repeated association of a Condi- 
tioned Stimulus (CS), which normally has no particular 
significance for an animal, with an Unconditioned 
Stimulus (UCS), which has significance for an animal 
and always gives rise to an Unconditioned Response 
(UCR). The response that comes to be elicited by the 
CS after classical conditioning is known as the Con- 
ditioned Response (CR) (Grossberg & Levine, 1987). 
Hence, classical conditioning is the putative learning 
process that enables animals to recognize informative 
stimuli in the environment. 

In the case of operant conditioning, an animal learns 
the consequences of its actions. More specifically, the 
animal learns to exhibit more frequently a behaviour 
that has led to reward in the past, and to exhibit less 
frequently a behaviour that led to punishment. 

In the field of neural networks research, it is often 
suggested that neural networks based on associative 
learning laws can model the mechanisms of classical 
conditioning, while neural networks based on rein- 
forcement learning laws can model the mechanisms of 
operant conditioning (Chang & Gaudiano, 1998). 

The reinforcement learning is used to acquire naviga- 
tion skills for autonomous vehicles, and updates both 
the vehicle model and optimal behaviour at the same 
time (Galindo, Gonzalez & Fernandez-Madrigal, 2006; 
Lamiraux & Laumond, 2001; Galindo, Fernandez- 
Madrigal & Gonzalez, 2007). 

In this article, we propose a neurobiologically in- 
spired neural architecture to show how an organism, 
in this case a robot, can learn without supervision to 
recognize simple stimuli in its environment and to as- 
sociate them with different actions. 



ARCHITECTURE OF THE NEURAL 
CONTROL SYSTEM 

Figure 1(a) illustrates our proposed neural architecture. 
The trajectory tracking control without obstacles is 
implemented by the SODMN and a neural network 
of biological behaviour implements the avoidance 
behaviour of obstacles. 

Self-Organization Direction Mapping 
Network (SODMN) 

The transformation of spatial directions to wheels an- 
gular velocities is expressed like a linear mapping and 
is shown in Fig. 1(b). The spatial error is computed to 
get a spatial direction vector (DVs). The DVs is trans- 
formed by the direction mapping network elements V. k 
to corresponding motor direction vector (DVm). On 
the other hand, a set of tonically active inhibitory cells, 
which receive broad-based inputs that determine the 
context of a motor action, was implemented as a context 
field. The context field selects the V.. elements based 

ik 

on the wheels angular velocities configuration. 

A speed-control GO signal acts as a non-specific 
multiplicative gate and controls the movement's overall 
speed. The GO signal is an input from a decision centre 
in the brain, and starts at zero before movement and 
then grows smoothly to a positive value as the move- 
ment develops. During the learning, the GO signal is 
inactive. 

Activities of cells of the DVs and DVm are rep- 
resented in the neural network by quantities (S T , S 2 , 
..., S m ) and (i? 1? R 2 , ..., i? n ), respectively. The direction 
mapping is formed with a field of cells with activities 
V. k . Each V. k cell receives the complete set of spatial 
inputs S.J = 1, ..., m, but connects to only one R. cell. 
The direction mapping cells ( V e R nxk ) compute a differ- 
ence of activity between the spatial and motor direction 
vectors via feedback from DVm. During learning, this 
difference drives the adjustment of the weights. During 
performance, the difference drives DVm activity to the 
value encoded in the learned mapping. 

A context field cell pauses when it recognizes a 
particular velocity state (i.e., a velocity configuration) 
on its inputs, and thereby disinhibits its target cells. 
The target cells (direction mapping cells) are com- 
pletely shut off when their context cells are active (see 
Fig. 1(b)). Each context field cell projects to a set of 
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Figure 1. (a) Neural architecture for reactive and adaptive navigation of a mobile robot, (b) Self-organization 
direction mapping network for the trajectory tracking of a mobile robot. 
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direction mapping cells, one for each velocity vector 
component. Each velocity vector component has a set 
of direction mapping cells associated with it, one for 
each context. A cell is "off' for a compact region of the 
velocity space. It is assumed for simplicity that only 
one context field cell turns "off at a time. The centre 
context field cell is "off when the angular velocities 
are in the centre region of the velocity space. The "off 
context cell enables a subset of direction mapping cells 
through the inhibition variable c k , while "on" context 
cells disable to the other subsets. 

The learning is obtained by decreasing weights in 
proportion to the product of the presynaptic and post- 
synaptic activities (Gaudiano, & Grossberg, 1991). The 
training is done by generating random movements, and 
by using the resulting angular velocities and observed 
spatial velocities of the mobile robot as training vectors 
to the direction mapping network. 

Neural Network for the Avoidance 
Behaviour (NNAB) 

Grossberg proposed a model of classical and operant 
conditioning, which was designed to account for a 



variety of behavioural data on learning in vertebrates 
(Grossberg, 1971; Grossberg & Levine, 1987). Our 
implementation is based in the Grossberg's condition- 
ing circuit, which follows closely that of Grossberg & 
Levine (1987) and Chang & Gaudiano (1998), and is 
shown in Figure 2. 

In this model the sensory cues (both CSs and UCS) 
are stored in Short Term Memory (STM) within the 
population labelled S, which includes competitive 
interactions to ensure that the most salient cues are 
contrast enhanced and stored in STM while less salient 
cues are suppressed. The population S is modelled as a 
recurrent competitive field in simplified discrete-time 
version, which removes the inherent noise, efficiently 
normalizes and contrast-enhances from the ultrasound 
sensors activations. In the present model, the CS nodes 
correspond to activation from the robot's ultrasound 
sensors. In the network I. represents a sensor value 
which codes proximal objects with large values and 
distal objects with small values. The network requires 
no knowledge of the geometry of the mobile robot or 
the quality, number, or distribution of sensors over the 
robot's body. 
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Fig. 2. Neural Network for the avoidance behaviour 
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The drive node D corresponds to the Reward/ 
Punishment component of operant conditioning (an 
animal/robot learns the consequences of its own ac- 
tions). Learning can only occur when the drive node is 
active. Activation of drive node D is determined by the 
weighted sum of all the CS inputs, plus the UCS input, 
which is presumed to have a large, fixed connection 
strength. The drive node D is active when the robot col- 
lides with an obstacle. Then the unconditioned stimulus 
(USC) in this case corresponds to a collision detected 
by the mobile robot. The activation of the drive node 
and of the sensory nodes converges upon the popula- 
tion of polyvalent cells P. Polyvalent cells require the 
convergence of two types of inputs in order to become 
active. In particular, each polyvalent cell receives input 
from only one sensory node, and all polyvalent cells 
also receive input from the drive node D. 

Finally, the neurons (xm.) represent the response 
conditioned or unconditioned and are thus connected 
to the motor system. The motor population consists of 
nodes (i.e., neurons) encoding desired angular veloci- 
ties of avoidance. When driving the robot, activation 
is distributed as a Gaussian centred on the desired 
angular velocity of avoidance. The use of a Gaussian 
leads to smooth transitions in angular velocity even 
with few nodes. 

The output of the angular velocity population is 
decomposed by SODMN into left and right wheel 
angular velocities. A gain term can be used to specify 
the maximum possible velocity. In NNAB the prox- 
imity sensors initially do not propagate activity to 
the motor population because the initial weights are 



small or zero. The robot is trained by allowing it to 
make random movements in a cluttered environment. 
Whenever the robot collides with an obstacle during 
one of these movements (or comes very close to it), the 
nodes corresponding to the largest (closest) proximity 
sensor measurements just prior to the collision will 
be active. Activation of the drive node D allows two 
different kinds of learning to take place: the learning 
that couples sensory nodes (infrared or ultrasounds) 
with the drive node (the collision), and the learning 
of the angular velocity pattern that existed just before 
the collision. 

The first type of learning follows an associative 
learning law with decay. The primary purpose of this 
learning scheme is to ensure that learning occurs only 
for those CS nodes that were active within some time 
window prior to the collision (UCS). The second type 
of learning, which is also of an associative type but 
inhibitory in nature, is used to map the sensor activa- 
tions to the angular velocity map. By using an inhibi- 
tory learning law, the polyvalent cell corresponding 
to each sensory node learns to generate a pattern of 
inhibition that matches the activity profile active at 
the time of collision. 

Once learning has occurred, the activation of the 
angular velocity map is given by two components (see 
Figure 3). An excitatory component, which is gener- 
ated directly by the sensory system, reflects the angular 
velocity required to reach a given target in the absence 
of obstacles. The second, inhibitory component, gener- 
ated by the conditioning model in response to sensed 
obstacles, moves the robot away from the obstacles 
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Figure 3. Positive Gaussian distribution represents the angular velocity without obstacle and negative distribu- 
tion represents activation from the conditioning circuit. The summation represents the angular velocity that will 
be used to drive the mobile robot. 
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as a result of the activation of sensory signals in the 
conditioning circuit. 



EXPERIMENTAL RESULTS 



mobile robot is approaching an obstacle, the inhibitory 
profile from the conditioning circuit (NNAB) changes 
the selected angular velocity and makes the mobile 
robot turn away from the obstacle. 



The proposed control algorithm is implemented on 
a mobile robot from the Polytechnic University of 
Cartagena (UPCT) named "CHAMAN". The plat- 
form has two driving wheels (in the rear) mounted on 
the same axis and two passive supporting wheels (in 
front) of free orientation. The two driving wheels are 
independently driven by two DC-motors to achieve 
the motion and orientation. 

High-level control algorithms (SODMN and NNAB) 
are written in VC++ and run with a sampling time of 
10 ms on a remote server (a Pentium IV processor). 
The lower level control layer is in charge of the execu- 
tion of the high-level velocity commands. It consists 
of a Texas Instruments TMS320C6701 Digital Signal 
Processor (DSP). 

Figure 4 shows approach behaviours and the track- 
ing of a trajectory by the mobile robot with respect to 
the reference trajectory. 

Figure 5 illustrates the mobile robot's performance 
in the presence of several obstacles. The mobile robot 
starts from the initial position labelled X and reaches a 
desired position. During the movements, whenever the 



FUTURE TRENDS 

The tendency of robots' control systems is to come 
to understand and to imitate the way that biological 
systems learn and evolve to resolve complex problems 
in unknown environments. Simple animals (e.g.: crabs, 
insects, scorpions and other ones) are studied to for- 
malize robust neural models for the robots ' locomotion 
system. In humans, decoded neural behaviors of neural 
activities of the cortical system tend to be applying to 
robotic prosthesis for the control of movement. Neural 
networks and other bio-mimetic techniques with an 
emphasis on navigation and control are used to operate 
in real-time with only minimal assumptions about the 
robots or the environment, and that can learn, if needed, 
with little or no external supervision. 

In this article, the proposed neural control system 
can be applied for underwater applications. In this case, 
sonar sensors will replace ultrasound sensors. The pro- 
posed neural architecture learns to carry out a reactive 
and adaptive navigation nonstationary environments. 
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Figure 4. Adaptive control by the SODMN. a) Approach behaviours. The symbol X indicates the start of the 
mobile robot and T indicates the desired reach, b) Tracking control of a desired trajectory, c) Real-time track- 
ing performance. 
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Figure 5. Trajectory followed by the mobile robot in presence of obstacles using the NNAB 
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CONCLUSION 

In this article, we have implemented a neural architec- 
ture for trajectory tracking and avoidance behaviours 
of mobile robot. Abiologically inspired neural network 
for the spatial reaching tracking has been developed. 
This neural network is implemented as a kinematical 
adaptive neuro-controller. The SODMN uses a context 
field for learning the direction mapping between spatial 
and angular velocity coordinates. The performance of 
this neural network has been successfully demonstrated 
in experimental results with the trajectory tracking and 
reaching of a mobile robot. The avoidance behaviours 
of obstacles were implemented by a neural network that 
is based on a form of animal learning known as oper- 
ant conditioning. A differentially driven mobile robot 
tested the efficacy of the proposed neural network for 
avoidance behaviours experimentally. 
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KEY TERMS 

Artificial Neural Network: A network of many 
simple processors ("units" or "neurons") that imitates 
a biological neural network. The units are connected 
by unidirectional communication channels, which 
carry numeric data. Neural networks can be trained 
to find nonlinear relationships in data; and are used in 
applications such as robotics, speech recognition, and 
signal processing or medical diagnosis. 

Classical Conditioning: It is a form of associative 
learning that was first demonstrated by Ivan Pavlov. The 
typical procedure for inducing classical conditioning 
involves a type of learning in which a stimulus acquires 
the capacity to evoke a response that was originally 
evoked by another stimulus. 

Conditioned Response (CR): If the conditioned 
stimulus and the unconditioned stimulus are repeatedly 
paired, eventually the two stimuli become associated 
and the organism begins to produce a behavioral re- 
sponse to the conditioned stimulus. Then, the condi- 
tioned response is the learned response to the previously 
neutral stimulus. 



Conditioned Stimulus (CS): It is a previously 
neutral stimulus that, after becoming associated with 
the unconditioned stimulus, eventually comes to trigger 
a conditioned response. The neutral stimulus could be 
any event that does not result in an overt behavioral 
response from the organism under investigation. 

Operant Conditioning: The term "Operant" refers 
to how an organism operates on the environment, and 
hence, operant conditioning comes from how we re- 
spond to what is presented to us in our environment. 
Then the operant conditioning is a form of associative 
learning through which an animal learns about the 
consequences of its behaviour. 

Unconditioned Response (UR) : It is the unlearned 
response that occurs naturally in response to the un- 
conditioned stimulus. 

Unconditioned Stimulus (UCS): Which is one that 
unconditionally, naturally, and automatically triggers 
an innate, often reflexive, response in the presence of 
significant stimulus. For example, when you smell one 
of your favourite foods, you may immediately feel very 
hungry. In this example, the smell of the food is the 
unconditioned stimulus. 
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INTRODUCTION 

According to the World Health Organization (http:// 
www.who.int/cancer/en), cancer is a leading cause of 
death worldwide. From a total of 58 million deaths in 
2005, cancer accounts for 7.6 million (or 13%) of all 
deaths. The main types of cancer leading to overall 
cancer mortality are i) Lung (1.3 million deaths/year), 
ii) Stomach (almost 1 million deaths/year), iii) Liver 
(662,000 deaths/year), iv) Colon (655,000 deaths/year) 
and v) Breast (502,000 deaths/year). Among men the 
most frequent cancer types worldwide are (in order of 
number of global deaths): lung, stomach, liver, color- 
ectal, oesophagus and prostate, while among women 
(in order of number of global deaths) they are: breast, 
lung, stomach, colorectal and cervical. 

Technological advancements in recent years are 
enabling the collection of large amounts of cancer 
related data. In particular, in the field of Bioinformat- 
ics, high-throughput microarray gene experiments are 
possible, leading to an information explosion. This 
requires the development of data mining procedures 
that speed up the process of scientific discovery, and 
the in-depth understanding of the internal structure 
of the data. This is crucial for the non-trivial process 
of identifying valid, novel, potentially useful, and 
ultimately understandable patterns in data (Fayyad, 
Piatesky-Shapiro & Smyth, 1 996). Researchers need to 
understand their data rapidly and with greater ease. In 
general, objects under study are described in terms of 
collections of heterogeneous properties. It is typical for 
medical data to be composed of properties represented 
by nominal, ordinal or real-valued variables (scalar), 
as well as by others of a more complex nature, like 
images, time-series, etc. In addition, the information 



comes with different degrees of precision, uncertainty 
and information completeness (missing data is quite 
common). 

Classical data mining and analysis methods are 
sometimes difficult to use, the output of many proce- 
dures may be large and time consuming to analyze, 
and often their interpretation requires special expertise. 
Moreover, some methods are based on assumptions 
about the data which limit their application, specially 
for the purpose of exploration, comparison, hypothesis 
formation, etc, typical of the first stages of scientific 
investigation. This makes graphical representation di- 
rectly appealing. Humans perceive most of the informa- 
tion through vision, in large quantities and at very high 
input rates. The human brain is extremely well qualified 
for the fast understanding of complex visual patterns, 
and still outperforms the computer. Several reasons 
make Virtual Reality (VR) a suitable paradigm: i) it is 
flexible (it allows the choice of different representation 
models to better suit human perception preferences), 
ii) allows immersion (the user can navigate inside the 
data, and interact with the objects in the world), iii) 
creates a living experience (the user is not merely a 
passive observer, but an actor in the world) and iv) VR 
is broad and deep (the user may see the VR world as 
a whole, and/or concentrate on specific details of the 
world). Of no less importance is the fact that in order 
to interact with a virtual world, only minimal skills 
are required. 

Visualization techniques may be very useful for 
medical decision support in the oncology area. In 
this paper unsupervised neural networks are used for 
constructing VR spaces for visual data mining of gene 
expression cancer data. Three datasets are used in the 
paper, representative of three of the most important 
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types of cancer in modern medicine: liver, stomach 
and lung. The data sets are composed of samples from 
normal and tumor tissues, described in terms of tens of 
thousands of variables, which are the corresponding 
gene expression intensities measured in microarray ex- 
periments. Despite the very high dimensionality of the 
studied patterns, high quality visual representations in 
the form of structure-preserving VR spaces are obtained 
using SAMANN neural networks, which enables the 
differentiation of cancerous and noncancerous tissues. 
The same networks could be used as nonlinear feature 
generators in a preprocessing step for other data min- 
ing procedures. 



NEURAL NETWORKS FOR THE 
CONSTRUCTION OF VIRTUAL REALITY 
SPACES 

VR spaces for the visual representation of information 
systems (Pawlak, 1991) and relational structures were 
introduced in (Valdes, 2002a) (Valdes, 2003). A VR 
space is a tuple Q.=<o,G,B,R m ,g ,l,g r ,b,r > , where 
O is a relational structure ( O =< 0,r v > ), O is a finite 
set of objects, and r v is a set of relations); G is a non- 
empty set of geometries representing the different 
objects and relations; B is a non-empty set of behaviors 
of the objects in the virtual world; R m c ^R m is a metric 
space of dimension m (Euclidean or not) which will 
be the actual VR geometric space. The other elements 
are mappings: g : O — > G, / : O — > R m , # r : r v -> G 
and b : O -> B. 

The typical desiderata for the visual representation 
of data and knowledge can be formulated in terms of 
minimizing information loss, maximizing structure 
preservation, maximizing class separability, or their 
combination, which leads to single or multi-objective 
optimization problems. In many cases, these concepts 
can be expressed deterministically using continuous 
functions with well defined partial derivatives. This 
is the realm of classical optimization where there is a 
plethora of methods with well known properties. In the 
case of heterogeneous information the situation is more 
complex and other techniques are required (Valdes, 
2002b) (Valdes, 2004) (Valdes & Barton, 2005). In the 
unsupervised case, the function f mapping the original 
space to the VR (geometric) space R m can be constructed 
as to maximize some metric/non-metric structure 
preservation criteria as is typical in multidimensional 



scaling (Borg & Lingoes, 1987) or minimize some 
error measure of information loss (Sammon, 1969). A 
typical error measure is: 



Sammon Error ■ 



S (5. 
j < j { 
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where 5 .. is a dissimilarity measure between two obj ects 
i,j in the original space, and ^.. is another dissimilarity 
measure defined on objects z, j in the VR space (the 
images of i,j under /). Typical dissimilarity measures 
for 8.. are the Euclidean distance or the dissimilarity 
based on Gower 's similarity coefficient (Gower, 1971). 
The Euclidean distance is the usual measure for L. in 
the VR space. 

Usually, the mappings f obtained using approaches 
of this kind are implicit because the images of the ob- 
jects in the new space are computed directly. However, 
a functional representation of f is highly desirable, 
specially in cases where more samples are expected 
a posteriori and need to be placed within the space. 
With an implicit representation, the space has to be 
computed every time that a new sample is added to 
the set, whereas with an explicit representation, the 
mapping can be computed directly. As long as the 
incoming objects can be considered as belonging to 
the same population of samples used for constructing 
the mapping function, the space does not need to be 
recomputed. Neural networks are natural candidates for 
constructing explicit representations due to their general 
universal approximation property. If proper training 
methods are used, neural networks can learn structure 
preserving mappings of high dimensional samples into 
lower dimensional spaces suitable for visualization 
(2D, 3D). If visualization is not a requirement, spaces 
of smaller dimension than the original can be used as 
new features for noise reduction or other data mining 
methods. Such an example is the SAMANN network. 
This is a feedforward network and its architecture 
consists of an input layer with as many neurons as 
descriptor attributes, an output layer with as many 
neurons as the dimension of the VR space and one 
or more hidden layers. The classical way of training 
the SAMANN network is described in (Mao & Jain, 
1995). It consists of a gradient descent method where 
the derivatives of the Sammon error are computed in 
a similar way to the classical backpropagation algo- 
rithm. Different from the backpropagation algorithm, 
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the training is unsupervised and the weights can only 
be updated alter a pair of examples are presented to 
the network. 



CANCER DATA SETS DESCRIPTION 

Three microarray gene expression cancer databases 
were selected. They are representative of some of the 
leading causes of cancer death in the world and share 
the typical features of these kind of data: a small number 
of samples (in the order of tens), described in terms of 
a very large number of attributes (in the order of tens 
of thousands). 

Liver Cancer Data 

We used the same data as in (Lam, Wu, Vega, Miller, 
Spitsbergen, Tong, Zhan, Govindarajan, Lee, Mathavan, 
Murthy, Buhler, Liu & Gong, 2006), where zebrafish 
liver tumors were analyzed and compared with human 
liver tumors. The database (http://www.ncbi.nlm.nih. 
gov/projects/geo/gds/gds_browse.cgi?gds=2220) con- 
tains 20 samples (10 normal, 10 tumor), with 16,512 
attributes . First, liver tumors in zebrafish were generated 
by treating them with carcinogens. Then, the expres- 
sion profiles of zebrafish liver tumors were compared 
with those of zebrafish normal liver tissues using a 
Wilcoxon rank- sum test. As a result of this comparison, 
a zebrafish liver tumor differentially expressed gene set 
consisting of 2,315 gene features was obtained. This 
data set was used for comparison with human tumors. 
The results suggest that the molecular similarities 
between zebrafish and human liver tumors are greater 
than the molecular similarities between other types of 
tumors (stomach, lung and prostate). 

Stomach Cancer Data 

We used the same data as in (Hippo, Taniguchi, 
Tsutsumi, Machida, Chong, Fukayama, Kodama & 
Aburatani, 2002), where a study of genes that are dif- 
ferentially expressed in cancerous and noncancerous 
human gastric tissues was performed. The database 
(http://www.ncbi.nlm.nih.gov/projects/geo/gds/ 
gds_browse.cgi?gds=1210) contains 30 samples (22 
tumor, 8 normal) that were analyzed by oligonucle- 
otide microarray, obtaining the expression profiles 
for 6,936 genes (7,129 attributes). Using the 6,272 



genes that passed a prefilter procedure, cancerous and 
noncancerous tissues were successfully distinguished 
with a two-dimensional hierarchical clustering using 
Pearson's correlation. However, the clustering results 
used most of the genes on the array. To identify the 
genes that were differentially expressed between cancer 
and noncancerous tissues, a Mann- Whitney's U test 
was applied to the data. As a result of this analysis, 
162 and 129 genes showed a higher expression in 
cancerous and noncancerous tissues, respectively. In 
addition, several genes associated with lymph node 
metastasis and histological classification (intestinal, 
diffuse) were identified. 

Lung Cancer Data 

We used the same data as in (Spira, Beane, Pinto-Plata, 
Kadar, Liu, Shah, Celli & Brody, 2004), where gene 
expressions were compared in for severely emphysema- 
tous lung tissue (from smokers at lung volume reduction 
surgery) and normal or mildly emphysematous lung 
tissue (from smokers undergoing resection of pulmo- 
nary nodules). The database (http://www.ncbi.nlm. 
nih.gov/projects/geo/gds/gds_browse.cgi?gds=737) 
contains 30 samples (18 severe emphysema, 12 mild 
or no emphysema), with 22,283 attributes. Genes with 
large detection P-values were filtered out, leading to 
a data set with 9,336 genes, that were used for sub- 
sequent analysis. Nine classification algorithms were 
used to identify a group of genes whose expression in 
the lung distinguished severe emphysema from mild or 
no emphysema. First, model selection was performed 
for every algorithm by leave-one-out cross-validation, 
and the gene list corresponding to the best model was 
saved. The genes reported by at least four classification 
algorithms (102 genes) were chosen for further analy- 
sis. With these genes, a two-dimensional hierarchical 
clustering using Pearson's correlation was performed 
that distinguished between severe emphysema and mild 
or no emphysema. Other genes were also identified 
that may be causally involved in the pathogenesis of 
the emphysema. 



EXPERIMENTAL SETTINGS 

Data Preprocessing 

For stomach and lung data, each gene was scaled to 
mean zero and standard deviation one (original data 
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were not normalized). For liver data, no transformation 
was performed (original data were log 2 ratios). 

Model Training 

For every data set, SAMANN networks were con- 
structed to map the original data to a 3D VR space. The 
Euclidean distance was the dissimilarity measure used 
for both the original and the VR spaces. The activation 
functions used were sinusoidal for the first hidden layer 
and hyperbolic tangent for the rest. A collection of 
models was obtained by varying some of the network 
controlling parameters: number of units in the first 
hidden layer (two different values), weights ranges in 
the first hidden layer (three different values), learning 
rates (three different values), momentum (three differ- 
ent values), number of pairs presented to the network 
at every iteration (three different values), number of 
iterations (three different values) and random seeds 
(four different values), for a total of 1,944 SAMANN 
networks for every data set. 

Computing Environment 

All of the experiments were conducted on a Condor 
pool (http://www.cs.wisc.edu/condor) located at the 
Institute for Information Technology, National Research 
Council Canada. 



RESULTS 

For every data set, we constructed the histograms of 
the Sammon error for the obtained networks. All of the 
empirical distributions were positively skewed (with 
the mode on the lower error side), which is a good 
behavior. In addition, the general error ranges were 
small. In table 1 some statistics of the experiments are 
presented: minimum, maximum, mean and Standard 



deviation for the best (i.e., with smallest Sammon er- 
ror) 1,000 networks. 

Clearly, it is impossible to represent a VR space 
on printed media (navigation, interaction, and world 
changes are all lost). Therefore, very simple geometries 
were used for objects and only snapshots of the virtual 
worlds are presented. Figures 1, 2 and 3 show the 
VR spaces corresponding to the best networks for the 
liver, stomach and lung cancer data sets respectively. 
Although the mapping was generated from an unsuper- 
vised perspective (i.e., without using the class labels), 
obj ects from different classes are differently represented 
in the VR space for comparison purposes. Transparent 
membranes wrap the corresponding classes, so that 
the degree of class overlapping can be easily seen. In 
addition, it allows to look for particular samples with 
ambiguous diagnostic decisions. 

The low values of the Sammon error indicate that 
the spaces preserved most of the distance structure of 
the data, therefore, giving a good idea about the distri- 
bution in the original spaces. The three virtual spaces 
are clearly polarized with two distribution modes, each 
one corresponding to a different class. Note, however, 
that classes are more clearly differentiated for the liver 
and stomach data sets than for the lung data set, where 
a certain level of overlapping exists. The reason for this 
may be that mild and no emphysema were considered 
members of the same class (see above). 

The advantage of using SAMANN networks is 
that, since the mapping f between the original and the 
virtual space is explicit, a new sample can be easily 
transformed and visualized in the virtual space. Since 
the distance between any two objects is an indication 
of their dissimilarity, the new point is more likely to 
belong to the same class of its nearest neighbors. In the 
same way, outliers can be readily identified, although 
they may result from the space deformation inevitably 
introduced by the dimensionality reduction. 



Table 1. Statistics of the best 1,000 SAMANN networks obtained 







Sammon 


Error 




Data Set 


Minimum 


Maximum 


Mean 


Std.Dev. 


Liver Cancer 


0.039905 


0.055640 


0.049857 


0.003621 


Stomach Cancer 


0.062950 


0.077452 


0.072862 


0.003346 


Luna Cancer 


0.079242 


0.107842 


0.094693 


0.006978 
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CONCLUSION 

High quality virtual reality spaces for visual data min- 
ing of typical examples of gene expression cancer data 
were obtained using unsupervised structure -preserv- 
ing neural networks in a distributed computing data 
mining (grid) environment. These results show that 
a few nonlinear features can effectively capture the 



Figure 1 . VR space of the liver cancer data set (Sammon 
error = 0.039905, best out of 1,944 experiments). Dark 
spheres: normal, Light spheres: cancerous samples. 




similarity structure of the data and also provide a good 
differentiation between the cancer and normal classes. 
A similar study can be found in (Valdes, Romero & 
Gonzalez, 2007). 

However, in cases where the descriptor attributes 
are not directly related to class structure or where there 
are many noisy or irrelevant attributes the situation may 
not be as clear. In these cases, feature subset selection 
and other data mining procedures could be considered 
in a preprocessing stage. 
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Figure 2. VR space of the stomach cancer data set 
(Sammon error = 0.062950, best out of 1,944 experi- 
ments). Dark spheres: normal, Light spheres: cancer- 
ous samples. 



Figure 3. VR space of the lung cancer data set (Sammon 
error = 0.079242, best out of 1,944 experiments). Dark 
spheres: severe emphysema, Light spheres: mild or no 
emphysema. The boundary between the classes in the 
VR space seem to be a low curvature surface. 
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KEY TERMS 

Artificial Neural Networks: Interconnected group 
of simple units (neurons) that, as a function of the 
connections between the units and the parameters, can 
compute complex behaviors and find nonlinear rela- 
tionships in data. They are used in applications such as 
robotics, signal processing, or medical diagnosis. 

Backpropagation Algorithm: Algorithm to com- 
pute the gradient with respect to the weights, used for 
the training of some types of artificial neural networks. 
It was first described by P. Werbos in 1 974, and further 
developed by D.E. Rumelhart, G.E. Hinton and R.J. 
Williams in 1986. 

Condor: Specialized workload management 
system for computer-intensive jobs in a distributed 
computing environment, developed at the university 
of Wisconsin-Madison (http://www.cs.wisc.edu/con- 
dor). It provides a job queuing mechanism, resource 
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monitoring and management, scheduling policy, and 
priority scheme. 

Data Mining: Nontrivial extraction of implicit, 
previously unknown and potentially useful information 
from data. Typically, analytical methods and tools are 
applied to data with the aim of identifying patterns, 
relationships or obtaining databases for tasks such as 
classification, prediction, estimation or clustering. 

Gene Expression: Process by which the inherit- 
able information which comprises a gene, such as the 
DNA sequence, is made manifest as a physical and 
biologically functional gene product, such as protein 
or RNA. 

SAMANN Neural Networks: Unsupervised 
feedforward neural networks for data projection. The 
classical way of training SAMANN networks was 
described by J. Mao and A.K. Jain in 1995. It consists 
of a gradient descent method where the derivatives of 
the Sammon error are computed in a similar way to 
the backpropagation algorithm. 



Sammon Error: Error function to maximize struc- 
ture preservation in projected data. It is defined as 




Z (5. 

i < j l ^ 



V 



18- 5.. 

i<r ] v 

where 8 and ^ are dissimilarity measures between 
two objects z, j in the original and projected space, 
respectively. 

Virtual Reality: Technology which allows the user 
to interact with a computer-simulated environment. 
Most current virtual reality environments are mainly 
visual experiences, displayed either on a computer 
screen or through special stereoscopic displays. Some 
advanced haptic systems include tactile information. 
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INTRODUCTION 

Processes in sport like motions or games are influenced 
by communication, interaction, adaptation, and spon- 
taneous decisions. Therefore, on the one hand, those 
processes are often fuzzy and unpredictable and so 
have not extensively been dealt with, yet. On the other 
hand, most of those processes structurally are roughly 
determined by intention, rules, and context conditions 
and so can be classified by means of information pat- 
terns deduced from data models of the processes. 

Self organizing neural networks of type Kohonen 
Feature Map (KFM) help for classifying information 
patterns - either by mapping whole processes to cor- 
responding neurons (see Perl & Lames, 2000; McGarry 
& Perl, 2004) or by mapping process steps to neurons, 
which then can be connected by trajectories that can 
be taken as process patterns for further analyses (see 
examples below). In any case, the dimension of the 
original data (i.e. the number of contained attributes) 
is reduced to the dimension of the representing neuron 
(normally 2 or 3), which makes it much easier to deal 
with. 

Additionally, extensions of the KFM-approach are 
introduced, which are able to flexibly adjust the net to 
dynamically changing training situations. Moreover, 
those extensions allow for simulating adaptation pro- 
cesses like learning or tactical behaviour. 

Finally, a current project is introduced, where tacti- 
cal processes in soccer are analysed under the aspect 
of simulation-based optimization. 



In Motor Analysis, a lot of data regarding positions, 
angels, speed, or acceleration of articulations can 
be recorded automatically by means of markers and 
high speed digital cameras. The problem is that those 
recorded data show a high degree of redundancy and 
inherent correlation: A leg consisting of thigh, lower 
leg, foot, and the articulations hip joint, knee, and ankle 
obviously has only a comparably small range of pos- 
sible movements due to natural restrictions. Therefore 
the quota of characteristic motion data is comparably 
small as well. Classification can help for deducing that 
relevant information from recorded data by mapping 
them to representative types or patterns. 

In Game Analysis, during the last about 5 years an 
increasing number of approaches have been developed 
which enable for automatic recording of position data. 
Based on the video time precision of 25 frames per 
second, 9.3 15.000 x-y-z-coordinate data from 22 play- 
ers and the ball can be taken from a 90-minutes soccer 
game. Obviously, the amount of data has to be reduced 
and to be focused to the major tactical patterns of the 
teams. Similar to what coaches are doing, the collection 
of players' positions can be reduced to constellations 
of tactical groups which interact like super-players and 
therefore enable for a computer-aided game analysis 
based on pattern analysis. 

As is demonstrated in the following, neural net- 
work-based pattern analysis can support the handling 
of those problems. 



MAIN FOCUS OF THE CHAPTER 



BACKGROUND 



Artificial Neural Networks 



A major problem in analysing complex processes in 
sport like motions or games often is the reduction of 
available data to useful information. Two examples shall 
make plain what the particular problems in sport are: 



Current developments in the fields of Soft Computing 
and/or Computational Intelligence demonstrate how 
information patterns can be taken from data collections 
by means of fuzziness, similarity and learning, which 
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the approach of Artificial Neural Networks gives an 
impressive example for. In particular self organizing 
neural networks of type KFM (Kohonen Feature Map) 
play an important role in aggregating input data to 
clusters or types by means of a self organized similarity 
analysis (Kohonen, 1995). 

Net-Based Process Analysis 

Processes can be mapped to attribute vectors - in a 
game, for example, by recording the positions of the 
players - which then can be learned by neurons. There 
is, of course, a certain loss of precision if replacing an 
attribute vector by a representing neuron, the entry 
of which is similar but normally not identical to that 
attribute vector. Nevertheless, there are two major 
advantages of the way a KFM maps input data to cor- 
responding neurons: 

1. The number of objects is dramatically reduced 
if using the representing neurons instead of 
the original attribute vectors: a 2 -dimensional 
20x20-neuron-matrix contains 400 neurons, 
while a 10-dimensional vector space with only 
10 different values per attribute already contains 
10 10 = 10.000.000.000 vectors. 

2. The dimension of input data is reduced to the 
dimension of the network (i.e. normally 2 or at 
most 3). This for example enables for mapping 
time-series of high-dimensional attribute vec- 
tors to trajectories of neurons that can easily be 
presented graphically. 



There are three ways of gaining information from 
data by means of Artificial Neural Networks of KFM- 
type: 

1. Neurons represent classes of similar data and so 
define types of information patterns. 

2. Clusters of neurons represent time-static classes of 
similar information patterns and so build structures 
of information patterns. 

3. Trajectories of neurons represent time-dynamic 
sequences of information patterns and so build 
2-dimensional mappings of time-dependent pro- 
cesses. Moreover, trajectories themselves build 
patterns and therefore can be input to a network 
for classifying their similarities - which is ex- 
tremely helpful not least in motor analysis or in 
game analysis. 

There are a large number of successful applications 
that demonstrate how those neural networks can be used 
for that pattern analysis (see Perl & Dauscher, 2006). 

Example "Gait Analysis": Reduction of 
Redundancy and Dimensionality 

In gait analysis, data from articulations like for example 
hip-joint, knee and ankle can automatically be recorded 
using markers and so build a time series of n-dimen- 
sional attribute vectors which can be trained to a net. 
The result is that each of those n-dimensional vectors is 
mapped to a 2-dimensional neuron of the net - i.e. the 
dimension is reduced from n to 2. Corresponding to the 
original time series the neurons can be connected by a 




Figure 1. Two trajectories of the same gait process, using 20 attribute values (left) and 10 attribute values 
(right), respectively. The high degree of similarity suggests that the missing 10 values are redundant and can 
be neglected. 



( a 1. a 2.--. a 19. a 2o) 
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trajectory, which represents the original n-dimensional 
process through a 2-dimensional trajectory - therefore 
enabling for a much easier similarity analysis (Perl, 
2004; Schollhorn,2004). Moreover, net-based analysis 
shows that, by avoiding redundancy, also the dimension 
of the original data can be reduced without loosing 
relevant information (see Figure 1). 

Example "Ergometer Rowing": Inter- and 
Intra-lndividual Process-Analysis 

With the same approach that was used for gait analysis, 
the process of rowing was analyzed under the aspect of 
inter-individual similarity and intra-individual stability. 
Obviously, there is a great similarity on the set of all 
trajectories (see Figure 2). 

However, the trajectories of rower A are perfectly 
similar to each other - demonstrating a high stability 
- while those of rower B are not as much. The expe- 
rience with rowing pattern is that net-based analysis 
of rowing trajectories is very sensitive and helps for 
detecting even small instabilities which otherwise could 
not have been detected from video frames or original 
time series of data vectors (see Perl & Baca, 2003). 

Example "Tactics in Games": 
Constellation Analysis 

In a more complex way, trajectories can improve the 
transparency of the tactical behaviour of players or 
even a team (net-based volleyball analysis: Jager, Perl 
& Schollhorn, 2007). A collection of player positions 



- i.e. a constellation - can be represented by a vector 
of position coordinates, which then a net can be trained 
with. Figure 3 shows two exemplars of the same trained 
volleyball-net, with small squares representing activated 
constellations and marked areas representing major 
constellation types. Obviously, the teams represented 
by the left and the right net activate quite different 
types of constellation. Moreover, the moves between 
the constellations - i.e. the edges and/or trajectories 

- are quite different, too: The left team moves between 
the areas, while the right team more or less selects an 
area and then adjusts its constellation. 

In a game like volleyball - i.e. with separated teams 

- it is comparably easy to deduce tactical ideas from 
those trajectory patterns. Some first result could be taken 
from handball too, where net-based analysis was helpful 
for detecting successful offence processes (net-based 
handball analysis: Pfeiffer & Perl, 2006; net-based soc- 
cer analysis: Lees et al., 2003; Leser, 2006). Based on 
those results, currently a proj ect is run which deals with 
simulation-based tactics-optimisation in soccer. First 
results are encouraging. They were shown as video- 
representation at the famous Documenta-exhibition on 
fine arts, 2007 in Kassel/Germany. 

Dynamic Extensions of KFM-Type Neural 
Networks 

Self organizing maps of KFM-type are very helpful 
for analyzing dynamic processes. They fail, however, 
if learning or other process dynamics are parts of the 
processes to be trained. This is due to the fact that the 



Figure 2. Trajectories of the rowing process of two rowers A and B, one stroke per graphic 
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Figure 3. Two examples of a net trained with constellations, where the marked squares represent frequent con- 
stellations and the marked areas represent major types of constellations. 






learning procedure of a KFM is externally controlled, 
resulting in a network that works like a tool, without 
being able to change with or adapt to changing process 
types or contexts. 

One successful approach that improves the dynam- 
ics of the learning process is that of the Dynamically 
Controlled Network (DyCoN: Perl, 2002 a/b), which 
is a KFM-derivate that is able to learn continuously. 
The idea is that each neuron contains an individual 
adaptive learning model based on the Performance 
Potential Metamodel (PerPot: Perl, 2002 a; learning 
strategies: Perl & Weber, 2004). 

While DyCoN helps for analysing dynamic learning 
processes, a different type of neural network is neces- 
sary for simulating those learning processes - in order 
to eventually schedule and optimize those processes 
individually. One important point was to dynamically 
adapt the capacity of the network to the requirements 
of the learning process. This was done by integrating 
the concept of Growing Neural Gas (GNG: Fritzke, 
1 997), where, briefly spoken, the number and positions 
of neurons vary time-dependently with the changing 
information flow from training, this way adapting the 
network size and topology to the training amount and 
content. The result is the Dynamically Controlled Neural 
Gas (DyCoNG) the concept of which completes the 
combination of DyCoN and GNG by specific „quality 
neurons" that reflect the information theoretical quality 
of information and therefore can measure the original- 
ity of a recorded activity (Perl et al., 2006). Based on 
the assumption that there is a strong correspondence 



between the „quality" of a neuron and the originality 
of the represented type of activity, the network's reac- 
tion on an input-stimulus (i.e. generating a new con- 
nected/not connected quality neuron or not) indicates 
an evaluation of the originality of the corresponding 
activity. According to the two tasks „analysis of creativ- 
ity learning" and „simulation of creativity learning", 
two major results could be obtained: 

The DyCoN-model was used for analyzing the learn- 
ing profiles, which were fed into as patterns and then 
recognized as members of clusters respectively types 
of learning behaviour. It was remarkable that the net 
could detect a number of significantly different types 
of learning behaviour - which in practice is useful for 
individually adjust the training to the athletes (Perl et 
al., 2006). 

The DyCoNG-model was used for learning profile 
simulation, with the original activity- and rating-data 
as input and learning profiles as output. The learning 
profiles resulting from DyCoNG-training could also 
be separated into types which qualitatively correspond 
to those from DyCoN-analysis. This at least gives an 
idea of how to manage the above mentioned individual 
adaptation by means of net-based simulation. 

In a first approach net-based originality analysis 
has successfully been used in case of handball: In a 
case study dealing with data from the Handball World 
Championship 2007 in Germany, offence activities of 
high originality could be detected with a remarkable 
high accordance to experts' evaluation. Moreover, a 
degree of originality per team and game could be mea- 
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sured, resulting in team-specific originality profiles that 
characterize increasing and decreasing playing qualities 
during the tournament. Currently, a similar project is 
run with soccer, where in a first attempt the final of the 
World Championship 2006 is analyzed. 



FUTURE TRENDS 



(see Figure 4). Such an associative network could help 
for an improved simulation of „creative" behaviour, 
based on a specific creativity potential that describes 
frequency, maximal distance, or neuron similarity of 
those associative jumps. 

Improvement of Tactical Process 
Patterns 



The two major ideas for planned future work are to 
expand net-based simulation of originality to associa- 
tive behaviour and to analyse the effects of virtually 
generated "creative" activities in simulated games. 

Net-Based Simulation of Associative 
Behaviour 

In a simplified way behaviour can be understood as 
recognizing the behavioural context like environment 
or situation followed by a context-oriented selection of 
a best fitting activity. In case of convergent behaviour 
this selection is more or less rule-based and determined. 
In case of divergent or creative behaviour the selection 
has a certain undetermined degree of freedom - i.e. 
spontaneous Jumps" are possible from a first priority 
activity to associated ones. Mapped to neural networks, 
where activities can be thought to be connected to 
neurons, this means a „jump" from the input-corre- 
sponding neuron to a different one - located either in 
a neighboured cluster or as an isolated quality neuron 



Figure 4. Net with clusters (marked by slim lines), as- 
sociative Jumps " between clusters (bold dotted lines), 
and generated quality neuron (bold line) 




The idea of optimizing strategies by means of simu- 
lation was developed in the early 1980ies for games 
like tennis or badminton, where the player's abilities 
and tactics in a simplified way can be characterized by 
two matrices: The action-depending transfer of situa- 
tions can be measured by a transfer frequency matrix, 
while the situation-depending success of actions can 
be measured by an action success matrix. Based on 
those two matrices of both the players, a game can be 
simulated stochastically regarding its main process 
structures. Moreover, modifying the entries of the 
matrices - i.e. changing tactical aspects or technical 
skills - can help for improving tactical patterns by 
means of simulation. 

Although soccer is much more complex then ten- 
nis or badminton, the same idea can be used if the 
complexity is reduced by introducing „super-players" 
as we do in a current project: Groups of players, e.g. 
representing offence or defence, are combined to cor- 
responding data objects, which are characterized by 
constellations of player positions. The interactions of 
the single players then are reduced to the interactions 
of the constellations or super-player, which makes 
it much easier to map the processes to networks for 
tactical analysis. The intended aim is to derive those 
characteristic matrices as well as information about 
creativity from the network in order to simulate games 
and improve tactical process patterns : As is indicated in 
Figure 5, a recorded original activity (white dot on the 
net) could be replaced by a apparently better or more 
creative one (white circle above the white dot), which 
in the simulation changes the regarding constellation 
and the resulting process and its success. 



CONCLUSION 

Net-based analysis of processes in sport is a difficult 
and challenging task because of the fuzziness and the 
indeterminism of athletes' behaviour and interaction. 
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Figure 5. Steps of net-based analysis and simulation of games like soccer: Replacing players by positions and 
positions by constellations; analysing constellations by means of networks; simulative modification of tactical 
patterns; analysing simulated games in order to improve tactical and creative behaviour. 





The result of about 30 years of work in this area is that 
a lot of problems could be solved methodically. The 
bottleneck, however, was the recording of data and the 
transfer to information. Meanwhile, data from biome- 
chanical, physiological, and medical applications can 
be recorded automatically, and even in games like soc- 
cer automatic position recording has become possible. 
Therefore the problem has changed from "how to get 
data" to "how to transfer data to information". 

The presented net-based approaches show how this 
problem can be handled, opening new perspectives of 
transferring theoretical approaches to practical work. 
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KEY TERMS 

Cluster: A collection of neurons is called a cluster, 
if they are similar and locally neighboured. Due to the 
topology preserving property of KfM-training classes 
of similar training vectors are mapped to clusters of 
neighboured neurons. 

DyCoN: A DyCoN is a KFM-type network, where 
each neuron contains an individual PerPot-based self- 
control of its activation radius and learning rate. The 
DyCoN-concept enables for continuous learning and 
therefore supports continuous training and testing, train- 
ing in phases and with generated data, on line-adapta- 
tion during tests and analyses, and flexible adaptation 
to new information patterns (Perl, 2002 a). (Note that 
DyCoN is used commercially. Therefore, technical 
details cannot be published but are under secrecy by 
DyCoS GmbH (www.dycos.net)). 



DyCoNG: The concept of DyCoNG combines 
the concepts of DyCoN and GNG and completes it 
by dynamically generating "quality" neurons in order 
to represent relevant and rare information during the 
training process (Perl et al., 2006). 

GNG: A GNG is network without a fixed neuron 
topology, which is able to generate new neurons on 
demand. Therefore a GNG is able to dynamically 
adapt its neuron structure to amount and structure of 
the trained information (Fritzke, 1997). 

Information Pattern: An information pattern is a 
structure of information units like e.g. a vector or matrix 
of numbers, a stream of video frames, or a distribution 
of probabilities. 

KFM: A KFM consists of a (normally: 2-dimen- 
sional) matrix of neurons, each of which contains a 
vector of attributes. Two neurons are called similar if the 
(Euclidian) distance of their attribute vectors is below 
a given threshold. Two neurons are called neighboured 
if they are next to each other regarding the given net 
topology (see Kohonen, 1995). 

PerPot: PerPot is a model of dynamic adaptation, 
where an input flow feeds an internal strain potential 
as well as an internal response potentials, from which 
an output potential is fed by specifically delayed flows. 
Since the strain flow is negative and the response flow 
is positive, resulting in an oscillating stabilizing adapta- 
tion, the model is called antagonistic (Perl, 2002 a). 

Test: In a test, an attribute vector is fed to the 
network to determine its type - i.e. the neuron it is 
corresponding to. 

Training: During the training, attribute vectors are 
fed to the network and mapped to the corresponding 
neuron the entry of which is most similar to that of 
the attribute vector. After the training, the space of 
training attribute vectors is (more or less) completely 
represented by the neurons of the network - meaning 
that every training attribute vector belongs to a neuron 
the entry of which it is most similar to. 

Type: The collection of attribute vectors that, after 
training, is represented by a neuron is called its type. 
Also the representing neuron can be called the type. 
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INTRODUCTION 

All neural networks, both natural and artificial, are 
characterized by two kinds of dynamics. The first one 
is concerned with what we would call "learning dynam- 
ics", in fact the sequential (discrete time) dynamics of 
the choice of synaptic weights. The second one is the 
intrinsic dynamics of the neural network viewed as a 
dynamical system after the weights have been estab- 
lished via learning. Regarding the second dynamics, 
the emergent computational capabilities of a recurrent 
neural network can be achieved provided it has many 
equilibria. The network task is achieved provided it 
approaches these equilibria. But the dynamical system 
has a dynamics induced a posteriori by the learning 
process that had established the synaptic weights. It is 
not compulsory that this a posteriori dynamics should 
have the required properties, hence they have to be 
checked separately. 

The standard stability properties (Lyapunov, asymp- 
totic and exponential stability) are defined for a single 
equilibrium. Their counterpart for several equilibria 
are: mutability, global asymptotics, gradientbehavior. 
For the definitions of these general concepts the reader 
is sent to Gelig et al, (1978), Leonov et al, (1992). 

In the last decades, the number of recurrent neural 
networks' applications increased, they being designed 
for classification, identification and complex image, 
visual and spatio-temporal processing in fields as en- 
gineering, chemistry, biology and medicine (see, for 
instance: Fortuna et. al., 2001; Fink, 2004; Atencia et. 
al, 2004; Iwahori et. al, 2005; Maurer et. al, 2005; 
Guirguis & Ghoneimy, 2007). All these applications 
are mainly based on the existence of several equilibria 
for such networks, requiring them the "good behavior" 
properties above discussed. 

Another aspect of the qualitative analysis is the 
so-called synchronization problem, when an external 



stimulus, in most cases periodic or almost periodic has 
to be tracked (Gelig, 1982; Danciu, 2002). This prob- 
lem is, from the mathematical point of view, nothing 
more but existence, uniqueness and global stability of 
forced oscillations. 

In the last decades the neural networks dynamics 
models have been modified once more by introduc- 
ing the transmission delays. The standard model of 
a Hopfield-type network with delay as considered in 
(Gopalsamy & He, 1994) is 



dt 



= -a i u i (t) + T w ij9j( u j( t - x ij» + I i i=1 > n C 1 ) 



7=1 



The present paper aims to a general presentation, 
with both research and educational purposes, of the 
three topics mentioned previously. 



BACKGROUND 

Dynamical systems with several equilibria occur in 
such fields of science and technology as electrical 
machines, chemical reactions, economics, biology and, 
last but not least, neural networks. 

For systems with several equilibria the usual local 
concepts of stability are not sufficient for an adequate 
description. The so-called "global phase portrait" may 
contain both stable and unstable equilibria: each of 
them may be characterized separately since stability is 
a local concept dealing with a specific trajectory. But 
global concepts are also required for a better system 
description and this is particularly true for the case of 
the neural networks. Indeed, the neural networks may 
be viewed as interconnections of simple computing 
elements whose computational capability is increased 
by interconnection ("emergent collective capacities" 



Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. 



Neural Networks and Equilibria, Synchronization, and Time Lags 



- to cite Hopfield). This is due to the nonlinear char- 
acteristics leading to the existence of several stable 
equilibria. The network achieves its computing goal if 
no self-sustained oscillations are present and it always 
achieves some steady-state (equilibrium) among a finite 
(while large) number of such states. 

This behavior is most suitably described by the 
concepts arising from the papers of Kalman (1957) 
and Moser (1967). The last of them relies on the fol- 
lowing remark concerning the rather general nonlinear 
autonomous system 



sense); it is called quasi-monostable if every bounded 
solution is quasi-convergent. 

d) System (3) is called gradient-like if every solution is 
convergent; it is called quasi-gradient-like (has global 
asymptotics) if every solution is quasi-convergent. 

Remark that convergence is a solution property while 
monostability and gradient property are associated to 
systems. For autonomous (time invariant) systems of 
the form (2) the following Lyapunov type results are 
available. 



X = -f(x), Xi 



(2) 



where f(x) = grad G(x) and G:R n -> R is such that 
the number of its critical points is finite and is radially 
unboundedi.e. lim G(x) = oo. Under these assumptions 
any solution or (2) approaches asymtotically one 
of the equilibria (which is also a critical point of G 
-where its gradient, i.e. f vanishes). Obviously the best 
limit behavior of a neural network would be like this 
- naturally called gradient like behavior. Nevertheless 
there are other properties that are also important while 
weaker; in the following we shall discuss some of 
them. 

The mathematical object will be in the following 
the system of ordinary differential equations 



f(x,t) 



and we shall first define some basic notions. 



(3) 



Definition 1 a) Any constant solution of (3) is called 
equilibrium; the set of equilibria E is called station- 
ary set. b) A solution of (3) is called convergent if it 
approaches asymptotically some equilibrium: 



limx(t) = ceE . 



(4) 



A solution is called quasi-convergent if it approaches 
asymptotically the stationary set: 



limd(x(0,E) = 0, 



(5) 



with d(z, M) being the distance (in the usual sense) 

from the point z to the set M. 

c) System (3) is called monostable (strictly mutable) 

if every bounded solution is convergent (in the above 



Lemma 1 Consider system (2) and assume existence of 
a continuous function V :R n -> R thatisnonincreasing 
along any of its solutions. If additionally, a bounded 
on R + solution x(t) for which there exists some z > 
such that V(x(t)) = V(x(0)) is an equilibrium, then the 
system is quasi-monostable. 

Lemma 2 If the assumptions of Lemma 1 hold and, 
additionally, V(x) is radially unbounded then system 
(2) is quasi-gradient like. 

Lemma 3 If the assumptions of Lemma 2 hold and the 
set E is discrete (i.e. it consists of isolated points only) 
then system (2) is gradient-like. 



DYNAMICS ISSUES OF RECURRENT 
NEURAL NETWORKS 

Neural Networks as Systems with 
Several Equilibria 

It has been already mentioned that the emergent com- 
putational capacities of the neural networks are ensured 
by: a) nonlinear behavior of the neural cells; b) their 
connectivity. These two properties define the neural 
networks as dynamical systems with many equilibria 
whose performance depends on the (high) number of 
these equilibria and on the gradient like property of 
the network. 

On the other hand, the standard recurrent neural 
networks (Bidirectional Associative Memory (Kosko, 
1988), Hopfield (1982), cellular (Chua& Yang, 1988), 
Cohen-Grossberg (1983)), which contain internal feed- 
back loops - having thus the propensity for instability, 
possess some "natural", i.e. associated in a natural way, 
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Lyapunov function allowing to obtain the required 
qualitative properties (Rasvan, 1998). 

One of the most general models of neural networks 
that has a natural Lyapunov function is the Cohen- 
Grossberg model described by 



x t =a I .(x I .) 



MxJ-Xq.d-Cx.) 

J=l 



i=l,n, 



(6) 



with c. = c; this model may be written as 

y p J 



x = -A(x)gradV(x) (7) 

where A(x) is a diagonal matrix with the entries 

A,(x) = ^8, 
J dlM ' (8) 

and V : R" -> R is defined by 



V(x) = ^f j c ij d^ i )d j (x j )-f d ]b i ('k)d' i (X)dX 

*■ 1 1 



1 



(9) 



The presence of A(x) makes system (7) a pseudo- 
gradient system - compare to (2). 

The properties of the associated Lyapunov function 
(9) will give sufficient conditions in order to obtain 
the required qualitative behaviors for the system. The 
derivative function of (9) is: 



W(x) = -Ja / (x / K(x z ) 






<0 
(10) 



One can see that the inequality (10) holds provided 
a.(X) > and d.(X) are monotone nondecreasing. If 
additionally d.(X) are strictly increasing, then the set 
where W = consists of equilibria only. The system 
results quasi-gradient like i.e. every solution approaches 
asymptotically the stationary set. 

Consider now a model of artificial neural network 
implemented by electrical circuits: 



dv. 



^ R; 



f\\? rv / \ 

i?,.C,- L = -v,+Z- L ^(v,)-v J >i?,J,. 



(11) 



are subject to sector restrictions and global Lipschitz 
inequalities, it was only natural to try to improve the 
stability conditions using the Lyapunov functions sug- 
gested by the Popov frequency domain inequalities and 
the Yakubovich-Kalman-Popov lemma. For instance, 
in (Danciu & Rasvan, 2000) there was considered a 
rather general system with several sector restricted 
nonlinearities and the Lyapunov function was con- 
structed in a rational way starting from an improved 
frequency domain stability inequality of Popov type 
with PI multiplier. 

In the case of (11) this rather involved approach 
gives a gradient like behavior provided the symmetry 

condition R = R is observed. 

y ji 

Time Delays in Neural Networks 

We shall consider here the model (1). Since we do not 
dispose (yet) in the time delay case of an instrument like 
the Lyapunov like lemmas given in BACKGROUND, 
we have to restrict ourselves to the analysis of the 
stability of a particular equilibrium. 

If u iy i=l,...,n is some equilibrium of (1) and if the 
deviations z. = u i -u i are considered, the system 
in deviations is obtained 




dt 



-a i z ! .(0-£w !5 (f> J .(z j (t-T, 5 )), i = l,n 



;'=i 



(12) 



with q> j (z j ) = g j (u j )-g j (u j +z.). As known, if 
g :Rh>l satisfy the usual sigmoid conditions i.e. 
g.(0) = 0, monotonically increasing and globally Lip- 
schitz - that is 



Q ^ g j (o 1 )-g j (o 2 ) 
a l -<5 2 



<L } , \/G l *G 2 , 



(13) 



then (p. defined above are such. With the usual notations 
of the field, let z t (-) = z(t + •) denote the state of (12) at 
t; the state space will be considered C(-r,0; R n ) with 
r = max! ijy the space of continuous R n - valued mappings 
defined on [-r, 0] with the usual norm of the uniform 
convergence. One considers the Lyapunov-Krasovskii 
functional (the analogue of the Lyapunov function of 
the delayless case) suggested by (Nishimura & Kita- 
mura, 1969), V : C h^ R + as 



with (p.(-) being sigmoidal. Since sigmoidal functions 
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i j 'j 



z,(0) 



*, }q>, (6 ) c/6 + £ } (p,z 2 (6 ) + 8,(p 2 (z ; (6 ))>6 



(14) 

with 71. > 0, X. > 0,p.. > 0, 8.. > some free parameters. 
Considering this functional along the solutions of (12) 
and differentiating it with respect to t we may find the 
so-called derivative functional W : C i-> K. as below 

^(z) = jE-fl|W l z l 2 (0)-X l a l 9 l (z l (0)> l (0)- 

i=l 

- [ti ,z, (0) + ^p, (z,. (0))]J w 9 q> j (z j (-T „ )) 

+ ZZl PoZj 2 (0) + 5„9 J 2 ^(0))-p 9 z J 2 (-T 9 )-5 <p J 2 ^(-x„))] 

1 1 

(15) 

The problem of the sign for W gives the follow- 
ing choice of the free parameters in (14) (Danciu & 
Rasvan, 2007): 



^>0, 



f 2 \ 

1 C 



If Z(P,+8,)>0 



~cp 



^8 
;=iO 



VJ =1 "J'7 



(a,. -^7) <tc, < 2 Zf (a, +7^7) 

(16) 



The application of the standard stability theorems 
for time delay systems (Hale & Verduyn Lunel, 1993) 
will give asymptotic stability of the equilibrium z = 
( u = u ). The mathematical result reads as follows 

Theorem 3: Consider system (12) with a. > and w.. 
such that it is possible to choose p.. > and 8.. > in 
order to satisfy a. > with a. defined in (16). Then the 
equilibrium is globally asymptotically stable. 

Synchronization Problems 

From this point of view the qualitative behavior of 
the network is nothing more but behavior under the 



time varying stimuli. This is particularly true for the 
modeling of rhythmic activities in the nervous system 
(Kopell, 2000) or the synchronization of the oscillatory 
responses (Konig & Schillen, 1991). Both rhythmic- 
ity and synchronization suggest some recurrence and 
this implies coefficients and stimuli being periodic or 
almost periodic. The model with time varying stimulus 
has the form 



du 
~dt 



l - = -aMV-i,w ij f j (y j (t-T ij )yc i (t), z=l,n 



(17) 

under the same assumptions as previously, with the 
functions f. : R h-> [-1,1] being sigmoidal and there- 
fore, globally Lipschitz. The forcing stimuli c.(t) are 
periodic or almost periodic and the main mathematical 
problem is to find conditions on the systems to ensure 
existence and exponential stability of a unique global 
(i.e. defined on f.) solution which has the features of 
a limit regime, i.e. not defined by initial conditions and 
of the same type as the stimulus - periodic or almost 
periodic respectively. This is an "almost linear behav- 
ior" for reasons that are obvious. The approach to be 
taken in this problem is to obtain some estimates of 
the system's solutions, which finally give information 
about system's convergence and ultimate boundedness. 
Next we have to apply a fixed-point theorem and we use 
the theorems of Halanay (Halanay, 1967) on invariant 
manifolds for flows on Banach spaces (see (Danciu, 
2002) for details and simulation results). 

We give below a theorem based on the application 
of the Lyapunov functional (14) but restricted to be only 
quadratic in the state variables (X. = 0, 8.. = 0), 



v( u )=l;[^ i u ! 2 (o) + Xp ! Ju J 2 (e)de 



with n. > 0, p.. > 0, i y j =l,n- We may state 



(18) 



Theorem 2 Assume that a> 0,L> and w.. are such 

i i ij 

that the derivative functional corresponding to c.(t) = 
in (1 7) namely 
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W(u) = £ 



•a 1 . 7 t,.u 1 2 (0)- J t,.u 1 .(0)2w,^ j («;H,;)) 



J=l 



+ ZZp^ j 2 (°)-« j 2 h«)] 



(19) 



/s negative definite with a quadratic upper bound. Then 
the system (1 7) has a unique global solution u i (t), i=l,n 
which is bounded on R and exponentially stable. 
Moreover, this solution is periodic or almost periodic 
according to the character of c.(t)- periodic or almost 
periodic respectively. 



FUTURE TRENDS 

Supposing the field of AI has its own dynamics, the 
neural networks and their structures will evolve in or- 
der to improve the imitative behavior i.e. more of the 
"natural" intelligence will be transferred to AI. Con- 
sequently, science and technology will deal with new 
structures of various physical natures having multiple 
equilibria. At least the following qualitative behaviors 
will remain under study: stability-like properties (di- 
chotomy, gradient behavior a.s.o.), synchronization 
(forced oscillations, almost linear behavior, chaos 
control) and complex dynamics (including chaotic 
behavior). 



CONCLUSIONS 

Our experience on neural networks dynamics shows 
that the most important study is to obtain conditions 
for gradient or quasi-gradient like behavior. Besides the 
comparison method of (Popov, 1979) which requires 
relaxation of the condition of the identical dynamics 
of all neurons, the most popular tool remains the Ly- 
apunov method. 

If the Lyapunov like lemmas given in BACK- 
GROUND would be available in the time delay case, 
then improved Lyapunov functionals remaining con- 
stant on the set of equilibria could ensure a gradient 
like behavior. 
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KEY TERMS 

Asymptotic Stability: The solution x(t) of (3) is 
called asymptotically stable if it is Lyapunov stable (see 
below) and, moreover, there exists 5 > such that if 
|x -x(t )\ <8 then lim|x(t;t ,x )-x(t)| = 0. 

Fixed Point Theorem: If f(x) is some function of 
real variable with real values, the values such that f(x) = 
x are called the fixed points of the mapping. In general, 
if f : X h^ X is a mapping from the metric space X inf 
: X X into itself, the fixed points of this mapping are 
defined as above. A fixed point theorem is a theorem 
showing under which conditions some mapping has a 
fixed point in the corresponding metric space. 

Frequency Domain Stability Inequality of Popov: 

Consider a feedback structure containing a linear dy- 
namical block with the transfer function H(s) and a 
nonlinear function - subject to the sector condition 
< (|)(a)o < kc 2 . The Popov inequality ensures absolute 
stability i.e. global asymptotic stability of the zero 
equilibrium for all nonlinear functions satisfying the 
above inequality and reads as follows: there exists 
some (3 such that 



- + 



9te(l + jco(3 )H( jco) > 0, Vco 



Global Stability: An equilibrium is global (asymp- 
totically) stable if it is the unique equilibrium of the 
dynamical system and the property holds globally (its 
domain of attraction is the entire state space). 

Lyapunov Function: State scalar function defined 
on the state space of a system in order to obtain some 
qualitative properties - stability of equilibria, oscillatory 
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behavior etc. - using a single function instead of several 
i.e. system's state trajectories. A Lyapunov function is 
usually positive definite and, along system's trajectories, 
is at least nonincreasing. The definite sign condition 
may also be relaxed for the generalized Lyapunov func- 
tions in the LaSalle sense. The basic physical model 
for the Lyapunov function is system's energy - a state 
function that is nonincreasing along the state trajectory 
being at the same time positive definite. The strength 
of the Lyapunov function is exactly its independence 
of the physical concepts since writing down the stored 
energy of a system is not an easy job except possibly 
such standard cases as mechanical systems or electrical 
circuits. The energy like concepts may be nevertheless 
inspiring when "guessing" a Lyapunov function. In the 
infinite dimensional cases e.g. time delay or propaga- 
tion systems, the Lyapunov function is replaced by a 
Lyapunov functional defined on the infinite dimensional 
state space. 

Oscillations (Self-Sustained and Forced): Type 
of steady state behavior when the state trajectories, 
while remaining bounded, never reach an equilibrium 
but their deviations from this equilibrium keep sign 
changing. Usually an oscillation is viewed as having 
some recurrent properties, being either periodic or 
almost periodic. When the system is autonomous i.e. 
free of external oscillatory signals while nevertheless 
displaying an oscillatory behavior which is sustained 
by non-oscillatory internal factors of the system, it is 
said that this system displays self-sustained oscillations 
(the term belongs to Mandelstamm and Andronov). 
When the system is non-autonomous and subject to 
external oscillatory signals (stimuli), the limit regime 
that occurs is called forced oscillation. 

Phase Portrait: Term borrowed from the Poincare 
theory of the phase (space) plane where this portrait 
is better defined. Its extension to higher order systems 
is mainly informal, based on geometric arguments. 
By phase portrait it is understood the total of state 
trajectories as limit regimes (equilibria, recurrent mo- 
tions, limit sets) and standard trajectories e.g. defined 
by initial conditions. 



Recurrent Neural Network (RNN): Neural net- 
works which display feedback interconnections among 
their units (neurons). Due to these cyclic connections 
RNNs are nonlinear dynamical systems with very rich 
spatial and temporal behaviors: stable and unstable 
fixed points, limit cycles and chaotic behavior. These 
behaviors make them suitable for modeling certain 
cognitive functions such as associative memory, unsu- 
pervised learning, self-organizing maps and temporal 
reasoning. 

Synchronization: Interaction phenomenon among 
coupled subsystems of a system resulting in some 
ordering of their evolution. Its maximal stage is the 
complete synchronization of the subsystems' periods 
resulting in a periodic evolution of the state of the 
entire system. When a system is externally forced by 
an oscillatory signal, synchronization means a limit 
regime of the entire state, which has the same waveform 
as the forcing signal (periodic with the same period if 
the forcing signal is periodic or almost periodic if the 
forcing signal is such). 

Stability: Qualitative property of the solution of 
a system with the significance of the limitation of the 
perturbations effect on the considered solution viewed 
as basic. Among all kinds of stability (bounded input/ 
bounded output, Lagrange stability, Birkhoff stability, 
input-to-state stability) the stability in the sense of Ly- 
apunov - with respect to the initial conditions, viewed 
as incorporating the effect of short-period perturbations 
- is the most widely used; it means that sufficiently small 
deviations in the initial condition (state) will result in 
arbitrarily small deviations in the current state at all 
following moments. Rigorously, the basic solution 
x(t) of (3) is called stable in the sense of Lyapunov 
if, for any 8 > arbitrarily small and any t el there 
exists some 5(8, t Q ) > sufficiently small such that if 
|x -x(t )|<8(8,t ),then|x(t;t ,x )-x(t)|<8 for all 
t > t Q . If in the above definition 5 is independent of the 
initial moment t Q the stability is called uniform; from the 
point of view of the practice, this is the more important 
stability notion of stability. It is also a necessary condi- 
tion for uniform asymptotic stability (see above). 
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INTRODUCTION 

Power quality (PQ) event detection and classification 
is gaining importance due to worldwide use of delicate 
electronic devices. Things like lightning, large switching 
loads, non-linear load stresses, inadequate or incorrect 
wiring and grounding or accidents involving electric 
lines, can create problems to sensitive equipment, if 
it is designed to operate within narrow voltage limits, 
or if it does not incorporate the capability of filtering 
fluctuations in the electrical supply (Gerek et. al., 2006; 
Moreno et. al., 2006). 

The solution for a PQ problem implies the acqui- 
sition and monitoring of long data records from the 
energy distribution system, along with an automated 
detection and classification strategy which allows 
identify the cause of these voltage anomalies. Signal 
processing tools have been widely used for this purpose, 
and are mainly based in spectral analysis and wavelet 
transforms. These second-order methods, the most 
familiar to the scientific community, are based on the 
independence of the spectral components and evolu- 
tion of the spectrum in the time domain. Other tools 
are threshold-based algorithms, linear classifiers and 
Bayesian networks. The goal of the signal processing 
analysis is to get a feature vector from the data record 
under study, which constitute the input to the computa- 
tional intelligence modulus, which has the task of clas- 
sification. Some recent works bring a different strategy, 
based in higher-order statistics (HOS), in dealing with 
the analysis of transients within PQ analysis (Gerek 
et. al., 2006; Moreno et. al., 2006) and other fields of 
Science (De la Rosa et. al, 2004, 2005, 2007). 



Without perturbation, the 50-Hz of the voltage 
waveform exhibits a Gaussian behaviour. Deviations 
from Gaussianity can be detected and characterized 
via HOS. Non-Gaussian processes need third and 
fourth order statistical characterization in order to be 
recognized. In order words, second-order moments 
and cumulants could be not capable of differentiate 
non-Gaussian events. The situation described matches 
the problem of differentiating between a transient of 
long duration named fault (within a signal period), and 
a short duration transient (25 per cent of a cycle). This 
one could also bring the 50-Hz voltage to zero instantly 
and, generally affects the sinusoid dramatically. By the 
contrary, the long-duration transient could be considered 
as a modulating signal (the 50-Hz signal is the carrier). 
These transients are intrinsically non-stationary, so it is 
necessary a battery of observations (sample registers) 
to obtain a reliable characterization. 

The main contribution of this work consists of the 
application of higher-order central cumulants to char- 
acterize PQ events, along with the use of a competitive 
layer as the classification tool. Results reveal that two 
different clusters, associated to both types of transients, 
can be recognized in the 2D graph. The successful results 
convey the idea that the physical underlying processes 
associated to the analyzed transients, generate different 
types of deviations from the typical effects that the noise 
cause in the 50-Hz sinusoid voltage waveform. 

The paper is organized as follows: Section on 
higher-order cumulants summarizes the main equations 
of the cumulants used in the paper. Then, we recall 
the competitive layer's foundations, along with the 
Kohonen learning rule. The experience is described 
then, and the conclusions are drawn. 
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HIGHER-ORDER CUMULANTS 

High-order statistics, known as cumulants, are used to 
infer new properties about the data of non-Gaussian 
processes (Mendel, 1991; Nikias & Mendel, 2003). 
The relationship among the cumulants of r stochastic 
signals, {x.j , and their moments of order/?, p < r, 
can be calculated by using the Leonov-Shiryayev for- 
mula (Nandi, 1999; Nikias & Mendel, 2003). For an 
rth-order stationary random process {x(t)j, the rth-order 
cumulant is defined as the joint rth-order cumulant of 
the random variables x(t), xft+T^, . . ., x(t+T rl ), 

C rx (T 1 ,T 2v ..,T r )=Clim[x(t)x(t + T 1 )...,x(t + T r )]. (1) 

Considering ^=t 2 =t 3 =0 in Eq. (1), we have some 
particular cases: 



IW i 1 - 1 (q)=IW/- 1 ( q -l) + ?t(q)-IW i 1 ' 1 (q-l), 



(3) 



? 2>x =£{ 2 (t)^C 2)X (0), 
? 3jX = £{ 3 (t)^C3 )X (0,0), 

? 4)X =£{ 4 (t)f3(? 2)X )F=C 4)X (0,0,0> 



(2a) 
(2b) 
(2c) 



Eqs. (2) are measurements of the variance, skewness 
and kurtosis of the statistical distribution, in terms of 
the cumulants at zero lags. We will use and refer to 
normalized quantities because they are shift and scale 
invariant. 



COMPETITIVE LAYERS 

The neurons in a competitive layer distribute themselves 
to recognize frequently presented input vectors. The 
competitive transfer function accepts a net input vec- 
tor p for a layer (each neuron competes to respond to 
p) and returns outputs of for all neurons except for 
the winner, which is associated with the most positive 
element of the net input. For zero bias, the neuron 
whose weight vector is closest to the input vector has 
the least negative net input and, therefore, wins the 
competition to output a 1. 

The winning neuron will move closer to the input, 
after this has been presented. The weights of the winning 
neuron are adjusted with the Kohonen learning rule. If 
for example the zth-neuron wins, the elements of the 
zth-row of the input weight matrix (IW) are adjusted 
as shown in Eq. (3): 



where p is the input vector, q is the time instant, and 
a is the learning rate. The Kohonen rule allows the 
weights of a neuron to learn an input vector, so it is 
useful in recognition applications. The winning neuron 
is more likely to win the competition the next time a 
similar vector is presented. As more and more inputs 
are presented, each neuron in the layer closest to a 
group of input vectors soon adjusts its weight vector 
toward those inputs. Eventually, if there are enough 
neurons, every cluster of similar input vectors will 
have a neuron that outputs "1" when a vector in the 
cluster is presented. 



EXPERIMENTAL RESULTS 

The aim is to differentiate between two classes of PQ 
events, named long-duration and short-duration. The 
experiment comprises two stages. The feature extrac- 
tion stage is based on the computation of cumulants. 
Each vector's coordinate corresponds to the local 
maximum and minimum of the 4 th -order central cu- 
mulant. Secondly, the classification stage is based on 
the application of the competitive layer to the feature 
vectors. We use a two-neuron competitive layer, which 
receives two-dimensional input feature vectors during 
the network training. 

We analyze a number of 16 1000-point real-life 
registers during the feature extraction stage. Before 
the computation of the cumulants, two pre-processing 
actions have been performed over the sample signals. 
First, they have been normalized because they exhibit 
very different-in-magnitude voltage levels. Secondly, 
a high-pass digital filter (5th-order Butterworth model 
with a characteristic frequency of 150 Hz) eliminates 
the low frequency components which are not the tar- 
gets of the experiment. This by the way increases the 
non-Gaussian characteristics of the signals, which in 
fact are reflected in the higher-order cumulants. Fig. 1 
shows the comparison of the two types of events. 

After pre-processing, a battery of sliding central 
cumulants (2 nd , 3 rd and 4 th -order) is calculated. Each 
cumulant is computed over 50 points; this window's 
length (50 points) has been selected neither to be so 
long to cover the whole signal nor to be very short. 
The algorithm calculates the 3 central cumulants over 
50 points, and then it jumps to the following starting 
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Analysis of a long-duration transient 






0.02 


0.04 


0.06 


0.08 


0.5 




1, 








o 

0.5 


AH 


ty* 


ly^hvivv^~ — .- 


— 


Time(s) . 



0.02 0.04 0.06 



0.12 
0.1 

0.08 
0.06 
0.04 
02 




2 -order sliding cumulant 



x 10 100 200 300 400 500 



j? 



U^- 



3 -order sliding cumulant 



JL 



x 10 100 200 300 400 500 




4 -order sliding cumulant 



100 200 300 400 500 
Number of segment 



0.5 



-0.5 

0.08 
0.06 
0.04 
0.02 

0.01 



-0.01 

10 
5 



Analysis of a short-duration transient 




0.02 0.04 0.06 0.08 0.1 



A/*- 



Time(s) 



0.02 0.04 0.06 0.03 0.1 



2 nd -order cumulant 



100 200 300 400 500 600 





3 rd -order cumulant 




V , . ,- 



x 10 



^100 200 300 400 500 600 





\ 4 th -order cumulant 

\ 



100 200 300 400 500 600 
Number of segment 



point; as a consequence we have 98 per cent overlapping 
sliding windows (49/50=0.98). Each computation over 
a window (called a segment) outputs 3 cumulants. 

The signal processing analysis indicates that the 
2 nd -order cumulant sequence (the variance), clearly indi- 
cates the presence of an event. Both types of transients 
exhibit an increasing variance in the neighbourhood of 
the PQ event, that present the same shape, with only 
one maximum. The magnitude of this maximum is by 
the way the only available feature which can be used 
to distinguish different events from the second-order 
point of view. 

Resulting from the classification stage, the bi- 
dimensional representation (2-dimensional feature 
vectors) suggests very intelligible 2-D graphs for 4 th - 
order. 3 rd -order diagrams don't show quite different 
clusters because maxima and minima are similar. It is 
possible to differentiate PQ events from the 3 rd -order 



perspective if we consider more features in the input 
vector (perhaps 3-D feature vectors), like the number 
of extremes (maxima and minima), and the order in 
which the maxima and the minima appear as time 
increases. 

The sliding 4 th -order cumulants exhibit clear dif- 
ferences, not only for the shape of the time-domain 
graphs, but also for the different location of minima, 
which suggest a clustering for the points in the 2-D 
feature space. Fig. 2 shows an example of 4 th -order 
cumulant sequence comparison for the two types of 
transients. For each sample register (data record) the 
sliding 4 th -order cumulants' sequence is calculated (as 
in Fig. 2). For each data record, the maximum and the 
minimum are detected and selected as a point in the 
feature space. 

Fig. 3 presents the results of the training stage, using 
the Kohonen rule. The horizontal (vertical) axis cor- 
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Figure 2. Comparison of4th-order cumulants' sequences for two types of transients 
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responds to the maxima (minima) values. Each cross 
in the diagram corresponds to an input vector and the 
circles indicate the final location of the weight vector 
(after learning) for the two neurons of the competitive 
layer. Before training, both weight vectors pointed to 
the asterisk, which is the initializing point (the midpoint 
of the input intervals). 

The separation between classes (inter-class distance) 
is well defined. Both types of PQ events are clustered. 
The correct configuration of the clusters is corrobo- 
rated during the simulation of the neural network, in 
which we have obtained an approximate classification 
accuracy of 97 percent. During the simulation, new 
signals (randomly selected from our data base) were 
processed using this methodology. The accuracy of 
the classification results increases with the number 
of data. To evaluate the confidence of the statistics a 



significance test has been conducted. As a result, the 
number of measurements is significantly correct. 



CONCLUSION 

In this paper we have proposed an automatic method 
to detect and classify two PQ transients, named short 
and long-duration. The method comprises two stages. 
The first includes pre-processing (normalizing and 
filtering) and outputs the 2-D feature vectors, each of 
which coordinate corresponds to the maximum and 
minimum of the central cumulants. The second stage 
uses a neural network to classify the signals into two 
clusters. This stage is different-in-nature from the one 
used in (Gerek et. al., 2006) consisting of quadratic 
classifiers. The configuration of the clusters is assessed 
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Figure 3. Competitive layer training results over 20 epochs. Upper cluster: Short-duration PQ-events. Down 
cluster: Long-duration events. 
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during the simulation of the network, in which we have 
obtained acceptable classification accuracy. 
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KEY TERMS 

Artificial Neural Networks: A network of many 
simple processors ("units" or "neurons") that imitates 
a biological neural network. The units are connected 
by unidirectional communication channels, which 
carry numeric data. Neural networks can be trained 
to find nonlinear relationships in data, and are used 
in applications such as robotics, speech recognition, 
signal processing or medical diagnosis. 

Cluster: A set of incidences relative to the charac- 
teristics associated to some signals, which have been 
previously analyzed. 

Cumulants : Statistics that characterize a probability 
distribution. A distribution with given cumulants can 
be approximated through the Edgeworth series. 

Competitive Layer: The neurons in a competitive 
layer distribute themselves to recognize frequently 
presented input vectors. 

HOS: Higher-Order Statistics; the set of statistics 
of order higher than 2. The advantage of using them 
is based on the advantage of noise rejection for sym- 
metrically distributed processes. 

Power Quality: Is the branch of research which 
aims to study the techniques for the assessment of the 
quality of electricity. 

Transient: A signal which vanishes with the time 
and usually with short duration. They are very common 
in industry applications. Transients may occur either 
in repeatable fashion or as random impulses. 
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INTRODUCTION 

Biometric offers potential for automatic personal 
identification and verification, differently from other 
means for personal verification; biometric means are 
not based on the possession of anything (as cards) or 
the knowledge of some information (as passwords). 
There is considerable interest in biometric authentica- 
tion based on automatic signature verification (ASV) 
systems because ASV has demonstrated to be superior 
to many other biometric authentication techniques e.g. 
finger prints or retinal patterns, which are reliable but 
much more intrusive and expensive. An ASV system 
is a system capable of efficiently addressing the task 
of make a decision whether a signature is genuine or 
forger. Numerous pattern recognition methods have 
been applied to signature verification. Among the 
methods that have been proposed for pattern recogni- 
tion on ASV, two broad categories can be identified: 
memory-based and parameter-based methods as a neural 
network. The Major approaches to ASV systems are 
the template matching approach, spectrum approach, 
spectrum analysis approach, neural networks approach, 
cognitive approach and fractal approach. 

The proposed article reviews ASV techniques 
corresponding with approaches that have so far been 
proposed in the literature. An attempt is made to de- 
scribe important techniques especially those involving 
ANNs and assess their performance based on published 
literature. The paper also discusses possible future areas 
for research using ASV. 



BACKGROUND 

As any human production, handwriting is subject to 
many variations from very diverse origins: Historic, 
geographic, ethnic, social, psychological, etc (Bou- 



letreau, 1998). ASV is a difficult problem because 
signature samples from the same person are similar 
but not identical. In addition, a person signature often 
changes radically during their lifetime (Hou, 2004). 
Although these factors can affect a given instance of 
a person writing, writing style develops as the writer 
learns to write, as do consistencies which are typically 
retained (Guo, 1997). One of the methods used by 
expert document examiners is to try to exploit these 
consistencies and identify ones which are both stable 
and difficult to imitate. In general, ASV systems can be 
categorized into two kinds: The On-line and Off-line 
systems. For On-line, the use of electronic devices to 
capture dynamics from signature permits to register 
more information about the signing process while im- 
proving the system performance, in the case of Off-line 
approaches for ASV, this dynamic information is lost 
and only a static image is available. This makes it quit 
difficult to define effective global or local features for 
the verification purpose. 

Three different types of forgeries are usually take into 
account on ASV system: random forgeries, produced 
without knowing either the name of the signer nor 
the shape of his signature; simple forgeries, produced 
knowing the name of the signer but without having an 
example of his signature; and skilled forgeries, produced 
by people who, looking at an original instance of the 
signature, attempt to imitate it as closely as possible. 
The problem of signature verification become more 
difficult when passing from random to simple and 
skilled forgeries, the later being so difficult a task 
that even human beings make errors in several cases. 
It is pointing out that several systems proposed up to 
now, while performing reasonably well on a single 
category of forgeries, decrease in performance when 
working with all the categories simultaneously, and 
generally this decrement is bigger than one would 
expect.(Abuhaiba,2007;Ferrer,2005). 
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Numerous pattern recognition methods have been 
applied to signature verification (Plamondon, 1989). 
Among the methods that have been proposed for pat- 
tern recognition, two broad categories can be identified: 
memory-based techniques in which incoming patterns 
are matched to a (usually large) dictionary of templates, 
and parameter-based methods in which pre-processed 
patterns are sent to a trainable classifier such as a neural 
network (Lippmann, 1987). Memory-based recogni- 
tion methods require a large memory space to store 
the templates, while a neural network is a parameter- 
based approach which just requires a small amount 
of memory space to store the linking weights among 
neurons. Mighell et al (Mighell, 1989) were apparently 
the first to work in applying NNs for off-line signature 
classification. Sabourin and Drouhard (Sabourin, 1 992) 
presented an method based on directional probability 
density functions together with a BackPropagation 
neural networks (BPN) to detect random forgery. Qi 
and Hunt (Qi, 1996) used global and grid features with 
a simple Euclidean distance classifier. Sansone and 
Vento (Sansone,2000) proposed a sequential three-stage 
multi-expert system, in which the first expert elimi- 
nates random and simple forgeries, the second isolates 
skilled forgeries, and the third gives the final decision 
by combining decisions of the previous stages together 
with reliability estimations. Baltzakis and Papamarkos 
(Baltzakis,2001)developedatwo-stageneuralnetwork, 
in which the first stage gets the decisions from neural 
networks and Euclidean distance classifiers supplied 
by the global, grid and texture features, and the second 
combines the four decisions using a radial-base func- 
tion (RBF) neural network. 



MAIN FOCUS OF THE CHAPTER 

As mentioned above, the major approaches to signa- 
ture verification systems are the template matching 
approach, spectrum approach, spectrum analysis ap- 
proach, neural networks approach, cognitive approach 
and fractal approach. The rigid template matching, the 
simplest and earliest approach to pattern recognition, 
can detect random forgeries from genuine signatures 
successfully, but cannot detect skilled forgeries ef- 
fectively. The statistical approach, including HHMs, 
Bayesian and so on, can detect random forgeries as 
well as skilled forgeries from genuine ones. Structural 
approach shows good performance when detecting 



genuine signatures and forgeries. But this approach 
may yield a combinatorial explosion of possibilities 
to be investigated, demanding large training sets and 
very large computational efforts. The spectrum analysis 
approach can be applied to different languages, includ- 
ing English and Chinese. Moreover it can be applied to 
either on-line or off-line verification systems. 

Neural networks approach offers several advantages 
such as, unified approaches for feature extraction and 
classification and flexible procedures for finding good, 
moderately nonlinear solutions. When it is used in either 
on-line or off-line signature verification, it also shows 
reasonable performance. 

Neural Networks on ASV 

Multi-layer perceptron (MLP) neural networks are 
among the most commonly used classifiers for pattern 
recognition problems. Despite their advantages, they 
suffer from some very serious limitations that make their 
use, for some problems, impossible. The first limitation 
is the size of the neural network. It is very difficult, for 
very large neural networks, to get trained. As the amount 
of the training data increases, this difficulty becomes 
a serious obstacle for the training process. The second 
difficulty is that the geometry, the size of the network, 
the training method used and the training parameters 
depend substantially on the amount of the training data. 
Also, in order to specify the structure and the size of 
the neural network, it is necessary to know a priori 
the number of the classes that the neural network will 
have to deal with. Unfortunately, when talking about 
a useful ASV, a priori knowledge about the number of 
signatures and the number of the signature owners is 
not available (Baltzakis,2001). 

For the BPN case, a learning law is used to modify 
weight values based on an output error signal propagated 
back through the network. From random initial values, 
the weights are changed according to this learning law 
that uses a learning rate and a smoothing rate which 
sometimes allows a faster convergence of the training 
phase. The training phase is critical, especially when 
the data to be classified are not clearly distinguishable 
and when there are not enough examples to conduct 
training. In this case, the training phase can be very 
long and it may even be impossible to obtain an accept- 
able performance. Usually a criterion for stopping the 
training phase is defined. After that, several rejection 
methods are evaluated to improve the decision taken by 
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this kind of classifier. Finally, the number of neurons 
in the hidden layer of the BPN is adjusted in order to 
increase the global performance of the first stage of 
theASV(Drouhard, 1996). 

An interesting aspect of BPN is that during learning 
process, the hidden layers build an internal representa- 
tion of the inputs that is useful to produce the output 
(Looney, 1997). (Fleming. 1990) used a two-stage 
NN with the same number of neurones for input and 
output layers, and fewer units for the hidden layer. This 
forces the network to encode the inputs in a smaller 
dimensional space retaining most of the relevant infor- 
mation in an equivalent way as the Principal Compo- 
nent Analysis (PCA) method. This class of networks 
are known as compression networks. An important 
property of compression networks is that they can act 
as auto associative or content addressable memories 
(Kohonen, 1977; Valentin, 1994). This means that these 
networks are able to acceptably reconstruct a degraded 
pattern when noise is given as input or to complete an 
incomplete input pattern (O'Toole, 1993). The quality 
of the results will depend on the number of hidden units 
of the compression network. 

On the other hand, Syntactic NNs can model stochas- 
tic and non-stochastic grammars. Learning is therefore 
a process of grammatical inference and recognition a 
process of parsing. Note that this has great generality; 
by varying the grammar we can encompass a wide range 
of pattern recognition models. The stochastic nets are 
properly probabilistic and are powerful discrimina- 
tors; the non-stochastic are less powerful, but have 
straightforward silicon implementation with existing 
technology. Learning in syntactic nets may proceed 
supervised or unsupervised (Lucas, 1990). 

Combined Classifiers Approaches 

(Baltzakis, 2001) presents a different technique for 
off-line signature recognition and verification. The 
proposed confronts above mentioned BPN problems 
by reducing the training computation time (This is 
achieved because each neural network corresponds to 
only one signature owner) and the size of the neural 
networks used (The feature set is split to three different 
groups, i.e., global features, grid features and texture 
features.). For each one of these feature sets a special 
two stage Perceptron OCON (one-class-one-network) 
classification structure has been implemented. In the 



first stage, the classifier combines the decision results 
of the neural networks and the Euclidean distance ob- 
tained using the three feature sets. The results of the 
first-stage classifier feed a second-stage radial base 
function (RBF) neural network structure, which makes 
the final decision. 

To effectively verify skilled forgeries, a fuzzy neural 
network named Pseudo Outer-Product based Fuzzy 
Neural Network (POPFNN) is integrated into the sig- 
nature verification system described in (Zhou, 1996). 
As a hybrid of fuzzy systems and neural networks, the 
POPFNN possesses many advantages such as high 
computational capability and learning ability when 
compared against other techniques used in signature 
verification systems. As hybrid intelligent systems, 
fuzzy NNs possess the advantages of both NNs and 
fuzzy rule-based systems and are particularly powerful 
in handling complex, non-linear and imprecise prob- 
lems such as ASV. Besides, the membership functions 
and fuzzy rules identified in the POPFNN give more 
transparency to the decision making process. These 
advantages make the proposed fuzzy neural network 
driven signature verification system particularly power- 
ful and robust even in dealing with skilled forgeries. In 
(Zhou, 1996), POPFNN operates in two fundamental 
modes, the learning mode and the classification mode. 
In the learning mode, a collection of training signature 
samples is used to train POPFNN. Feature vectors ex- 
tracted from the training signature samples are utilized 
to initialize and adjust the parameters of POPFNN, 
including membership functions, fuzzy rules, and 
weights of the links. In the classification mode, POPFNN 
performs pure classification without self-modification. 
Feature vectors extracted from the unknown signatures 
are fed into POPFNN and the corresponding outputs 
are obtained at the output layer of POPFNN. 

(Bromley, 1994) presents an algorithm based on 
a novel NN, called a "Siamese" neural network. This 
network has two input fields to compare two patterns 
and one output whose estate value corresponds to the 
similarity between the two patterns. During training 
the two sub-networks extract features from two signa- 
tures, while the joining neuron measures the distance 
between the two feature vectors. Training was carried 
out using a modified version of BP. All weights could 
be learnt, but the two sub-networks were constrained 
to have identical weights. 
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FUTURE TRENDS 

Notwithstanding the enormous work carried out in the 
field of signature verification, several questions still 
remains unresolved. New solutions to these problems 
will determine the conditions under which the sig- 
nature verification systems of the next future will be 
developed. The selection of the most suitable set of 
feature for a signer is one of the relevant open ques- 
tions and the use of new approaches for classification 
still an open problem. Genetic algorithms (GA) have 
been recently used for this purpose (Xuhua, 1996). 
Another promising area of research concern multi- 
expert verification, which combine hard (Dimauro, 
1997) and soft (Plamondon,1992) decision, based on 
parallel (Qi, 1995), serial (Cardot, 1991) or hybrid 
strategies(Cordella, 2000). 

In the framework of a handwritten text recognition 
application, (Heutte, 2004) have developed a multiple 
agent system able to manage interaction between dif- 
ferent contextual levels of handwriting interpretation. 
The EMAC (Hernoux, 1999) environment has been 
specified from constraints imposed by their handwrit- 
ing interpretation system. This work presents this 
platform as help to implement specific collaboration 
or cooperation schemes between agents which bring 
out new trends in the automatic reading of handwritten 
texts and could be implemented for automatic signature 
verification systems. 

(Balkrishana, 2007) recently presented a Colour 
Code Algorithm which deals with the recognition of 
the signature, as human operator generally make the 
work of signature recognition. Hence the algorithm 
simulates human behavior, to achieve perfection and 
skill through AI. The logic that decides the extent of 
validity of the signature must implement Artificial 
Intelligence Pattern recognition is the science that 
concerns the description or classification of measure- 
ments, usually based on underlying model. In future 
the system can be configured using Neural Networks 
and Fuzzy Rule base, where online training of recogni- 
tion is possible. 

A list of companies involved in signature verifica- 
tion systems production is given in (Kalenova, 2004), 
along with a short description of the products available. 
Although signature verification is not one of the safest 
biometric solutions, the use of it in business practices is 
still justified. Primarily due to the fact that the signature 



is a de facto mean of confirming the identity of the 
person, and therefore will provide a far less disruptive 
migration to an advanced technology than any other 
biometric can. Thus, signature verification has a very 
promising future. 



CONCLUSION 

Automatic signature verification is very attractive 
problem for researches. This article presents a review of 
approaches for Automatic Signature Verification using 
Neural Networks. The main aspects related to training 
process are discussed. Although some approaches have 
False Reject Rate and False Acceptance Rate ranging 
from 2% to 5%, systems developers cannot compare 
their results due to the lack of a widely accepted pro- 
tocol for experimental tests, as well as the absence of 
large, public signature databases. Auseful bibliography 
is also provided for interested readers. 
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KEY TERMS 

Agent Based Mode: A specific individual based 
computational model for computer simulation exten- 
sively related to the theme in complex systems, Monte 
Carlo Method, multi agent systems, and evolutionary 
programming. The idea is to construct the computational 
devices (agents with some properties) and then, simulate 
them in parallel to model the real phenomena. 

Automatic Signature Verification: A procedure 
that determine if a handwritten signature is genuine 
or a forgery, when a person claims for identity veri- 
fication. 

Backpropagation Algorithm: Learning algorithm 
of ANNs, based on minimising the error obtained from 
the comparison between the outputs that the network 
gives after the application of a set of network inputs 
and the outputs it should give (the desired outputs). 



Feature Selection: The technique, commonly used 
in machine learning, of selecting a subset of relevant fea- 
tures for building robust learning models. Its objective 
is three-fold: improving the prediction performance of 
the predictors, providing faster and more cost-effective 
predictors, and providing a better understanding of the 
underlying process that generated the data. 

Fuzzy Logic: Derived from fuzzy set theory dealing 
with reasoning that is approximate rather than precisely 
deduced from classical predicate logic. It can be thought 
of as the application side of fuzzy set theory dealing 
with well thought out real world expert values for a 
complex problem. 

Genetic Algorithms: A genetic algorithm is tech- 
nique used for searching or programming. It is used 
in computing to find true or approximate solutions to 
optimization and search problems of various types and 
used as a function in evolutionary computation. Genetic 
algorithms are based on biological events. They mimic 
biological evolution. 

Principal Component Analysis: A technique used 
to reduce multidimensional data sets to lower dimen- 
sions for analysis. PCAinvolves the computation of the 
eigenvalue decomposition of a data set, usually after 
mean centering the data for each attribute. 
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INTRODUCTION 

Computational Intelligence (CI) consists of an evolving 
collection of methodologies often inspired from nature 
(Bonissone, Chen, Goebel & Khedkar, 1999, Fogel, 
1999, Pedrycz, 1998). Two popular methodologies of 
CI include neural networks and fuzzy systems. 

Lately, a unification was proposed in CI, at a "data 
level", based on lattice theory (Kaburlasos, 2006). 
More specifically, it was shown that several types of 
data including vectors of (fuzzy) numbers, (fuzzy) sets, 
1D/2D (real) functions, graphs/trees, (strings of) sym- 
bols, etc. are partially(lattice)-ordered. In conclusion, a 
unified cross-fertilization was proposed for knowledge 
representation and modeling based on lattice theory with 
emphasis on clustering, classification, and regression 
applications (Kaburlasos, 2006). 

Of particular interest in practice is the totally-ordered 
lattice (R,<) of real numbers, which has emerged his- 
torically from the conventional measurement process 
of successive comparisons. It is known that (R,<) gives 
rise to a hierarchy of lattices including the lattice (F,<) 
of fuzzy interval numbers, or FINs for short (Papadakis 
& Kaburlasos, 2007). 

This article shows extensions of two popular neural 
networks, i.e. fuzzy -ARTMAP (Carpenter, Grossberg, 
Markuzon, Reynolds & Rosen 1992) and self-organ- 
izing map (Kohonen, 1995), as well as an extension 
of conventional fuzzy inference systems (Mamdani & 
Assilian, 1975), based on FINs. Advantages of the 
aforementioned extensions include both a capacity to 
rigorously deal with nonnumeric input data and a capac- 
ity to introduce tunable nonlinearities. Rule induction 
is yet another advantage. 



BACKGROUND 

Lattice theory has been compiled by Birkhoff (Birkhof f , 
1967). This section summarizes selected results regard- 



inga Cartesian product lattice (L,<)=(L 1 ,< 1 )x...x(L N ,< N ) 
of constituent lattices (L.,<.), i=l,...,N. 

Given an isomorphic function 0.: (L,<)— »(L,<.) a 
in a constituent lattice (L.,<.), i=l,...,N, where (L.,^) 5 
= (L.,<f ) denotes the dual (lattice) of lattice (L,<.), 
then an isomorphic function 0: (L,<)^(L,<) a is given 
by0(x 1 ,...,x N )=(e i (x 1 ) ? ... ? N (x N )). 

Given a positive valuation function v.: (L,<.)^R 
in a constituent lattice (L.,<.), i=l,. . .,N then a positive 
valuation v: (L,<)^>R is given by v(x p . . .,x N )=v 1 (x 1 )+. . . 

+v n(*n)« 

It is well-known that a positive valuation v.: (L,<.)^ 

R in a lattice (L,<) implies a metric function d.\ LxL.^ 

R I given by d.(a,b) = v.(avb) - v.(aAb). 

Minkowski metrics d : (L^^x . . . x(l_ N ,< N )= (L,<)^ 

R are given by 

d p (x,y)= [d 1 p (x 1 ,y 1 ) + ... + dP(x N ,y N )] \ 

where 

x= (x 1 ,...,x N ),y=(y 1 ,...,y N ),peR. 

An interval [a,b] in a lattice (L,<) is defined as the 
set [a,b] = {xe L: a<x<b, a,b e L} . Let x(L) denote the set 
of intervals in a lattice (L,<). It turns out that (x(L),<) 
is a lattice, ordered by set inclusion. 



Definition 1. The size Z: t(L)^Rq of a lattice 
(L,<) interval [a,b]ex(L), with respect to a 
positive valuation v: (L,<)^>R, is defined as 
Z([a,b])=d p (a,b). 



NEURAL/FUZZY COMPUTING BASED 
ON LATTICE THEORY 

This section delineates modified extensions to a hierar- 
chy of lattices stemming from the totally ordered lattice 
(R,<) of real numbers. Then, it details the relevance of 
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novel mathematical tools. Next, based on the previous 
mathematical tools, this section presents extensions 
of ART/SOM/FIS. Finally, it discusses comparative 
advantages. 

Modified Extensions in a Hierarchy of 
Lattices 

Consider the product lattice (A,<) = (RxR,< a x<) = 
(RxR,>x<) of generalized intervals. A generalized 
interval (element in A) will be denoted by [a,b] and 
will be called positive (negative) for a<b (a>b). The 
set of positive (negative) generalized intervals will be 
denoted by A + (A ) - We remark that the set of positive 
generalized intervals is isomorphic to the set of con- 
ventional intervals in the set R of real numbers. 

A decreasing function R : R^R is an isomorphic 
function R : (R,<)^(R,<) a ; furthermore, a strictly 
increasing function v R : R^R is a positive valuation 
v R : (R,<)^R. Hence, function v A : (A,<)^R given by 
v A ([a,b])= v R (0 R (a))+v R (£>) is a positive valuation in lat- 
tice (A,<). There follows a metric function d A : AxA^> 
R ; given by d A ([a,b],[c,d])=[v R (9 R (aAc))-v R (9 R (avc))] 
+ [v R (bvd)-v R (£>Ad)]; in particular, for R (x)= -x and 
v R (x)= x it follows v A ([a,b])= \a-c\ + \b-d\. Choosing 
parametric functions R (.) and v R (.) there follow tun- 
able nonlinearities in lattice (R,<). Moreover, note that 
A is a real linear space with 

addition defined as [a,b] + [c,d] = [a+c,b+d], 

and 

multiplication (by a real k) defined as k[a,b] = 

[ka,kb]. 

It turns out that A + (as well as A ) is cone in linear 
space A - Recall that a subset C of a linear space is 
called cone if for all xe C and X>0, we have Axe C. 

Definition 2. A generalized interval number (GIN) is 
a function f: (0,1]^A. 

Let G denote the set of GINs. It follows that (G,<) is 
a lattice, in particular (G,<) is the Cartesian product of 
lattices (A,<). Moreover, G is a real linear space with 

addition defined as (G x + G 2 )(h) = G^h) + G 2 (h), 
he (0,1], and 

multiplication (by a real k) defined as (kG)(h) = 
kG(h), he (0,1]. 



We remark that the cardinality of set G equals xf 1 = 
(2*° J 1 = 2*° Kl = 2* 1 = K 2 > X 1? where X x is the cardinal- 
ity of the set R of real numbers. 

Proposition 3. Consider metric(s) d A : AxA^R^ in 
lattice (A,<). Let G 1 ,G 2 e(G<). Assuming that 
the following integral exists, a metric function 
d Q : GxG^R J is given by 

i 
d G( G i' G 2 ) = |d A (G 1 (/i),G 2 (/i))d/i. 



Our interest here focuses on the sublattice (F,<) of 
lattice (G,<), namely sublattice of fuzzy interval num- 
bers (FINs). A FIN is defined rigorously as follows. 

Definition 4. A fuzzy interval number (FIN) F is a GIN 
such that either (1) both F(h)eA + and h<h 2 ^> 
F(h 1 )>F(/i 2 ), for all he (0,1] (positive FIN) or (2) 
there is a positive FIN P such that F(h) = -P(h), 
for all he (0,1] (negative FIN). 

Let F + (F ) denote the set of positive (negative) FINs. 
Note that both F + uF = F and F + nF =0 hold. Further- 
more, F + (F ) is a cone with cardinality tt 1 (Kaburlasos 
& Kehagias, 2006). The previous mathematical analysis 
may potentially produce useful techniques based on 
lattice vector theory (Vulikh, 1967). A positive FIN 
will simply be called "FIN". A FIN may admit different 
interpretations including a (fuzzy) number, an interval, 
and a cumulative distribution function. 

Relevance of Novel Mathematical Tools 

A fundamental mathematical result in fuzzy set theory 
is the "resolution identity theorem", which states that 
a fuzzy set can, equivalently, be represented either by 
its membership function or by its a-cuts (Zadeh, 1975). 
The aforementioned theorem has been given little 
attention in practice to date. However, some authors 
have capitalized on it by designing effective as well as 
efficient fuzzy inference systems (FIS) involving fuzzy 
numbers whose a-cuts are conventional closed intervals 
(Uehara & Fujise, 1993, Uehara & Hirota, 1998). 

This work builds on the abovementioned mathemati- 
cal result as follows. In the first place, we drop the pos- 
sibilistic interpretation of a membership function. Then, 
we consider the corresponding "a-cuts representation". 
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Next, we consider the metric cone Ff of (positive) FINs. 
In conclusion, we propose extensions of established 
neural/fuzzy algorithms, including ART (adaptive 
resonance theory), SOM (self-organizing map), and FIS 
(fuzzy inference systems), in Ff (Kaburlasos, 2007). 
A novelty of this work is an improved mathematical 
notation, which emphasizes relevance with the afore- 
mentioned "resolution identity theorem". 

An Extension of Fuzzy-ARTMAP 

A fuzzy-ARTMAP extension, namely fuzzy lattice 
reasoning (FLR), is presented in this section based 
on a similarity measure (function) defined in the fol- 
lowing. 

Definition 5. A similarity measure in a set S is a func- 
tion jlx: SxS— »(0,1], which satisfies the following 
conditions. 



(51) \i(a,b) = 1 <^ a = b. 

(52) n(a,b) = ii(b,a). 

(53) ' 



1 



[i(a,b) |u(x,x) |u(a,x) [i(x,b) 

A similarity measure is defined based on a metric 
function next. 

Proposition 6. If function d: SxS^R^ is a metric 
then function \i: SxS^(0,l] given by \i(a,b) = 
l/[l+d(a,b)] is a similarity measure. 

FLR for Training 

FLR-0:Aseti^B = {(i/ 1 ,C 1 ),...,(i/ L ,C L )} is given, where 

u { e Ff and C { e C, /= 1, . . . ,L is a class label in the 

finite set C. 
FLR-1: Present the next input pair (x.,K)e Ff xC, 

i=l,...,n to the initially "set" RB. 
FLR-2: If no more pairs are "set" in RB then store 

input pair (x.,K) in the RB; L<^L+l\ goto step 

FLR-1. 

Else, compute the similarity ^(x.,1/,) of inputx.e Ff 

with a "set" element u z e Ff , /=1,...,L in KB. 
FLR-3: Competition among the "set" pairs in the RB: 

Winner is pair (i/ T ,C T ) such that J= arg max ji(x. , u, ). 

r v y y ie{i,...,L} l ' 

In case of multiple winners, choose the one with the 
smallest size Z^.). 



FLR-4: Assimilation Condition: Both (1) size Z^x.viij) 
is less than a user-defined threshold size Z . . and 

cut' 

(2)K = C r 
FLR-5: If the Assimilation Condition is not satisfied 
then "reset" the winner pair (u^C^; goto step 
FLR-2. 

Else, replace the winner u } by the join-interval 
x.vl/j; goto step FLR-1. 

The corresponding testing phase is carried out by 
winner-take-all competition based on the similarity 
measure function |li(.,.). 

An Extension of SOM 

A straightforward SOM extension, namely granular 
SOM (grSOM), is presented in this section in cone 
Ff. 

grSOM for Training 

GR-0: The user defines the size L of a LxL grid of 
neurons. Each neuron can store both a iV-dimen- 
sional FIN W..€Ff, i,je{l,...,L} and a class 
label C. . e C, where C is a finite set. Initially all 
neurons are uncommitted. 

GR-1: Memorize the first training data pair (x ,K )g 
F + w xC by committing, randomly, a neuron in the 
LxL grid. 

Repeat the following steps a user-defined number 
N . of epochs. 

epochs r 

GR-2: For each training datum (x k ,X k )e Ff xC,k=l,. . .,n 
"reset" all LxL grid neurons. Then carry out the 
following computations. 

GR-3: Calculate the Minkowski metric distance 
<i (x, , W. .) between x. and committed neurons W. , 

l v k' i,y k i,j' 

i,je{l,...,L}. 
GR-4: Competition among the "set" (and, committed) 
neurons in the LxL grid: Winner is neuron (I,J) 
whose weight W l ; is the nearest to x k , that is (I, J) = 

arg mi n ^(x,,^), 

i,je{l,...,L} 

GR-5: Assimilation Condition: Both (1) Vector W. is 
in the neighborhood of vector W u on the LxL 
grid, and (2) C y = K k . 

GR-6: If the Assimilation Condition is satisfied then 
compute a new value W\ as 



1240 



Neural/Fuzzy Computing Based on Lattice Theory 



W. = 



h(k) 



i^Wi) 



^ + 



h(k) 



l+d^W,.) 



Else, "reset" the winner (I,J); goto GR-4. 
GR-7: If all the LxL neurons are "reset" then commit an 
uncommitted neuron from the grid, and memorize 
the current training datum (x k ,K k ). 
If there are no more uncommitted neurons then 
increase L by one. 

The corresponding testing phase is carried out by 
winner-take-all competition based on the Minkowski 
metric d^.,.). 

An Extension of FIS 

The basic idea towards novel FIS analysis and design 
is to employ a similarity measure function \jl(X,A.) = 
l/[l+d(X,A)], where X,A.e F + ^, as a fuzzy member- 
ship function regarding a rule R : A^C, where Ae 
Ff , C.g F + M , i=l„ . .,L (Kaburlasos &Kehagias, 2007). 
Advantages are presented in the following. 

Comparative Advantages 

First, an important advantage of the mathematical tools 
above is that the proposed ART/SOM/FIS extensions 
can handle, in any combination, numeric and/or non- 
numeric data, the latter include fuzzy numbers, intervals, 
and cumulative distribution functions. 

Second, we can employ parametric decreasing 
(increasing) functions R : R^>R (v R : R^R) in a data 
dimension, where the function parameters can be 
estimated/tuned optimally towards improving per- 
formance. 

Third, the proposed ART/SOM/FIS extensions can 
induce descriptive decision-making knowledge (i.e. 
rules) from the training data. 

Fourth, regarding the FLR, note that a similarity 
measure function |u(.,.) can effectively replace an inclu- 
sion measure function a(.,.) - Recall that the latter (func- 
tion) had replaced both of fuzzy-ARTMAP's Choice 
(Weber) function and Match function (Kaburlasos & 
Petridis, 2000, Kaburlasos, Athanasiadis & Mitkas, 
2007). The reason behind the aforementioned "effec- 
tive" replacement is that an inclusion measure o(A,B), 
or o(B,A), considers mainly one of A,Be Ff ; whereas, 



a similarity measure |li(A,B) considers both A,Be Ff 
based on their corresponding metric distance. 

Fifth, regarding the proposed SOM extension, note 
that this work carries out computations in the cone F + of 
FINs for faster data processing compared to a previous 
version of grSOM (Kaburlasos & Papadakis, 2006). 

Sixth, regarding the proposed FIS, novel advantages 
include a capacity to generalize beyond a fuzzy rule's 
support. The latter implies, potentially, an alleviation 
of the "curse of dimensionality" problem regarding 
the number of rules. 



FUTURE TRENDS 

Data-processing of FINs by multiplayer perceptrons is 
straightforward, as described in (Kaburlasos & Christo- 
foridis, 2006), and it will be pursued in future work. 



CONCLUSION 

This article has presented novel mathematical tools for 
unified analysis and design of neural/fuzzy systems. 
We built on fuzzy set theory's "resolution identity 
theorem". Nevertheless, in the first place, we dropped 
the possibilistic interpretation of a membership func- 
tion. Then, we considered the corresponding "a-cuts 
representation". Our interest focused on fuzzy interval 
numbers, or FINs for short, which can represent (fuzzy) 
numbers, intervals, and cumulative distribution func- 
tions. Based on lattice theory, we showed that the space 
of FINs is a metric cone. In conclusion, this works 
opens up the possibility to design FIN-to-FIN maps 
implementable on neural/fuzzy architectures including 
also tunable nonlinearities. 
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KEY TERMS 

ART: ART stands for Adaptive Resonance Theory. 
That is a biologically inspired neural paradigm for, 
originally, clustering binary patterns. An analog pattern 
version of ART, namely fuzzy-ART, is applicable in the 
unit hypercube. The corresponding neural network for 
classification is called fuzzy- ARTMAP. 

Dual (Lattice): Given a lattice (L,<), its dual lat- 
tice, symbolically (L,<) a or (L,< a ) = (L,>), is a lattice 
with the inverse order relation (>). 

FIS: FIS stands for Fuzzy Inference System. That 
is an architecture for reasoning involving fuzzy sets 
(typically fuzzy numbers) based of fuzzy logic. 
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Isomorhic (Function): Given two lattices (L^) 
and (L 2 ,< 2 ), an isomorphic function is a bijective (one- 
to-one) function cp: (L 1 ,< 1 )^(L 2 ,< 2 ) such that x<y o 
<p(x)<<p(y). 

Lattice: A lattice is a poset (L,<) any two of whose 
elements have both a greatest lower bound (g.l.b.), 
denoted by XAy, and a least upper bound (l.u.b.), de- 
noted by xvy. 

Poset: A partially ordered set (or, poset, for short) 
is a pair (P,<), where P is a set and < is an order rela- 
tion on P. The latter (relation) by definition satisfies 
(1) x<x, (2) x<y andy<x =^> x -y, and (3) x<y andy<z 
^>x<z. 

Positive Valuation (Function): Given a lattice 
(L,<), a positive valuation is a function v: (L,<)^ 



R, which satisfies both v(x)+v(y) : 
x<y ^ v(x)<v(y). 



v(xAy)+v(xvy) and 




Rule Induction: Process of learning, from cases 
or instances, if-then rule relationships that consist of 
an antecedent (if-part, defining the preconditions or 
coverage of the rule) and a consequent (then-part, 
stating a classification, prediction, or other expres- 
sion of a property that holds for cases defined in the 
antecedent). 

SOM: SOM stands for Self-Organizing Map. That 
is a biologically inspired neural paradigm for clustering 
analog patterns. SOM is often used for visualization of 
nonlinear relations of multi-dimensional data. 

Subattice: A sublattice (S,<) of a lattice (L,<) 
is another lattice such that both Sc=L and x,yeS ^> 
XAy, xvy eS. 
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INTRODUCTION 



BACKGROUND 



The Self-Organizing Map (Kohonen, 1997) is an ef- 
fective and a very popular tool for data clustering and 
visualization. With this method, the input samples are 
projected into a low dimension space while preserving 
their topology. The samples are described by a set of 
features. The input space is generally a high dimen- 
sional space R d . 2D or 3D maps are very often used for 
visualization in a low dimension space (2 or 3). 

For many applications, usually in psychology, biol- 
ogy, genetic, image and signal processing, such vector 
description is not available; only pair-wise dissimilar- 
ity data is provided. For instance, applications in Text 
Mining or ADN exploration are very important in this 
field and the observations are usually described through 
their proximities expressed by the "Levenshtein", or 
"String Edit" distances (Levenshtein, 1966). The first 
approach consists of the transformation of a dissimi- 
larity matrix into a true Euclidean distance matrix. A 
straightforward strategy is to use "Multidimensional 
Scaling" techniques (Borg & Groenen, 1 997) to provide 
a feature space. So, the initial vector SOM algorithm 
can be naturally used. If this transformation involves 
great distortions, the initial vector model for SOM is 
no longer valid, and the analysis of dissimilarity data 
requires specific techniques (Jain & Dubes, 1988; Van 
Cutsem, 1994) and Dissimilarity Self Organizing Map 
(DSOM) is a new one. 

Consequently, adaptation of the Self-Organizing 
Map (SOM) to dissimilarity data is of a growing in- 
terest. During this last decade, different propositions 
emerged to extend the vector SOM model to pair-wise 
dissimilarity data. The main motivation is to cope with 
large proximity databases for data mining. In this article, 
we present a new adaptation of the SOM algorithm 
which is compared with two existing ones. 



Basically, there are two main approaches to the SOM 
extension dealing with dissimilarity data. The first 
one uses a probabilistic framework, as for example 
in Graepel & Obermayer (1999) where a topographic 
mapping of proximity is derived by simulated anneal- 
ing. The second approach uses directly the initial SOM 
framework to adapt the two usual steps (affectation, 
representation) to dissimilarity data, as for example 
in Kohonen & Somervuo (1998, 2002), in El Golli, 
Conan-Guez & Rossi (2004), and in Ambroise & 
Govaert, (1996). 

Our work is inspired by this last approach and we 
have compared our proposal to the algorithms proposed 
by Kohonen (Kohonen & Somervuo, 1998) (Kohonen 
& Somervuo, 2002) and by El Golli et al. (El Golli, 
Conan-Guez & Rossi, 2004). Three metrics for quality 
estimate (quantization and neighborhood) are used for 
comparison. Numerical experiments on artificial and 
real data show the quality of the algorithm. The strong 
point of the proposed algorithm comes from a more 
accurate prototype estimate which is one of the most 
difficult parts of Dissimilarity SOM algorithms. 

The major difficulty of the DSOM is the constraint 
on the output data representation. For (vector) SOM 
algorithm, there is a latent data model for each output 
prototype (a spherical distribution whose the prototype 
is the barycentre). For DSOM, there is no data model 
for each output prototype. One referent observation is 
explicitly associated to each output prototype instead 
of its tuning by the barycentre processing. This referent 
is usually chosen among the input observations at the 
end of an optimization process. Consequently, several 
prototypes can unfortunately share the same referent 
and these collisions provide great distortions in the 
output map. To avoid this difficulty, we propose here 
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an implicit referent for each prototype which is adapted 
during training iterations. So there is no collision during 
learning phase and consequently, the projection quality 
is greatly enhanced. 



ADAPTATION OF SOM FOR 
DISSIMILARITY DATA 

This article presents a new DSOM algorithm for dis- 
similarity data. We will first present DSOM algorithms 
which have been directly derived from the initial SOM 
framework. In the next parts, we will present in detail 
our proposed algorithm and some experiments to show 
its effectiveness in comparison with the other DSOM 
algorithms. 

Description of DSOM Algorithms 

Basically the starting point of the DSOM algorithm is 
the "batch" algorithm of the initial vector SOM. Let 
us recall this "batch" algorithm. At each iteration, the 
entire dataset is presented. We consider a dataset X of N 
observations, X = {o., i = l.JV} . The SOM is configured 
with C nodes (neurons) a priori interconnected on the 
output map where 5(c,Z) is the distance between the 
nodes c and /. At iteration t, each node is represented 
by a prototype co^ in the input space. After an initial- 
ization step, an affectation step and a representation 
step are sequentially processed at each iteration. The 
role of the former is to assign to each observation o., 
the best matching unit co^ , according to the Euclidean 
distance. The affectation function is: 



c* = Arg 



Min(d 2 (o p © c ))] 



(1) 



Thus, a partition of the whole dataset is realized. 
In the latter, the prototype co c is adjusted to represent 
each partition X c as well as possible. This prototype is 
computed as the weighted average of the input samples. 
The weights are evaluated through the neighborhood 
function h T (.) which is a non-increasing function of the 
distance on the map and controlled by a radius parameter 
T(t) decreasing with time. At the end, the prototype co c 
is the gravity centre of the partition X c . 

This representation step cannot be directly trans- 
posed to dissimilarity data. An alternative implementa- 
tion is to approximate these C prototypes by referent 



observations belonging to the initial dataset X. Then, 
this step becomes very time-consuming: all the input 
observations are candidate and must be evaluated. Some 
strategies to reduce the computation time have been 
proposed (Conan-Guez, Rossi & El Golli, 2006). 

Let us notice D = [d.] i,j = l.JV, the dissimilarity 
data. These dissimilarities describe a non metric space. 
However, for all the DSOM algorithms, we consider 
symmetric dissimilarities. 

For the DSOM proposed by Kohonen, each proto- 
type will be represented by one referent observation, 
co c = o . During the initialisation step, C observations 
in the input dataset are randomly assigned to the proto- 
types. For the affectation step, the affectation function 
simply uses the input dissimilarity data. Each observa- 
tion is assigned to the nearest prototype: 

f (0 = ^9 [Min (d ir(c) )] = Arg [m&i (d (o f ,o r(c) ))] (2) 

For the representation step, a new observation o 
is assigned to the prototype co c minimizing the follow- 
ing cost function: 




r(c) = Arg 



Mzn(£(c,j)) 



(3) 



where E(c,j) is the weighted local distortion if o. is the 
referent of the prototype co c : 

^;)=Zh r (5(c,f(0)K(o., O! .) 

The global cost function which is then minimized 
is the global distortion over all the prototypes: 



E g =^£( C) r(c)) 



(5) 



For the representation step, different variants are 
possible. The neighborhood function in Eq. (4) can be 
simply integrated on the neighborhood of the prototype 
(the search is realized over the union of the partitions 
inside an output neighborhood) and not on the weighted 
dissimilarities. It is the "set Mean search". Also, the 
exponent '2' in Eq. (4) can be omitted: it is the "set 
Median search". 

Different prototypes can share the same referent 
(collision) when the search of the referent observa- 
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tions is limited to the input observations. So there is 
ambiguity for the affection step in the next iteration. 
This is the major difficulty of this approach. In some 
applications, for instance for symbol string organiza- 
tion, it is possible to search the "Median" or "Mean" 
outside the initial set: the referents in DSOM are not 
necessarily represented by elements belonging to the 
input space. But this optimization is an NP-hard prob- 
lem. See Martinez, Juan & Casacuberta (2001) for a 
comparison of different strategies. 

In El Golli, Conan-Guez & Rossi (2004), El Golli et 
al. propose a slightly different approach. The theoretical 
interest of this approach is the possibility to represent a 
prototype by more than one referent observation (q > 1 ). 
This allows to take into account a more complex latent 
data structure (multimodal distribution for instance) for 
each partition. Unfortunately in practice, it is difficult to 
choose the number (q) of referents by prototype and the 
optimization step becomes even more time-consuming. 
Let us describe here the algorithm for q = 1. 

For the affectation step, a distance between an obser- 
vation and a prototype is defined in Eq. (6). When the 
neighborhood is decreasing, this distance converges 
towards the initial dissimilarity. The representation step 
is the same as previously. With convergence, these two 
algorithms are similar. 



situation, we propose an implicit representation step. 
Let us remark, during training and until convergence, a 
referent observation is only used to define the distance 
between a prototype and an input observation. So, we 
will define a proximity measure £) r (o.,co c ) without ex- 
plicit referent to the prototype co c . The representation 
phase will simply adapt this proximity considering the 
new partition of the observations and the update of the 
neighborhood function. This simple implementation 
has a counterpart: it is necessary to define a data model 
based on latent Euclidean assumptions. 

Let us consider a set X of vector samples, X = {x., 
i = 1..JV, x. e R d }. Let g be the gravity centre of X and 
I(X), its inertia: 

All the samples have the same uniform weight ( 
j). The inertia with respect to any observation e is 
then defined and decomposed thanks to the Huygens 
theorem: 

I(X,e) = ^d 2 (e,x i ) = d 2 (g,e) + I(X) 



tfio^y^mcoy^o^) 



(6) 



Ambroise & Govaert (1996) proposes a different 
approach inspired from SEM (Stochastic Expectation 
Maximization) algorithm. The representation step is 
a "set Median search". The assignation step uses a 
stochastic process to affect each observation to the 
prototypes by a multinomial distribution (the propor- 
tions depend on the neighbourhood function and the 
affectation to the prototypes). 

Description of the Proposed DSOM 
Algorithm 

As explained previously, the difficulty is the representa- 
tion step due to the lack of data model. The set of refer- 
ent candidates is finite and distortions occur if several 
prototypes share the same referent. To overcome this 



Moreover, I(X) can be computed by considering all 
the distances d(x., x): 



iV x t eX XjGX,j>i 



(8) 



Thus, with Euclidean hypothesis, there is no need to 
know the gravity centre g, for computing the distance 
of any observation e to this gravity centre: d\g,e) - 
I(X,e)-I(X). 

We apply this principle to dissimilarity data. The 
input data is noticed o. instead of x. for vector data. The 
same formula is generalized to non uniform weighted 
observations. Let us consider one partition X c associ- 
ated to the prototype co c after the affectation step. The 
proximity between an observation o. and the prototype 
co c is then defined by using the weights m. /c for each 
observation o. given the prototype co c . The inertia I(X c ) 
is evaluated over all the weighted dissimilarities: 
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D T (o i ,(a c ) = I(X c ,o t )-I(X c )=^m jlc d\o i ,o j )-I(X c ) 

OjeX 

h T (8(c,f(j))) « 
2X(5( C) f(/))) » 



(9) 



(10) 



o ; eX Oj<eX ,j>i 

Therefore, the algorithm is the following: 



Initialization step: Having an initial partition, X c , 
c - 1..C, with for instance, an affectation from an 
initial random referent observation set. 
Representation step: For all prototypes co c and 
observations o., compute the weights m in Eq. 
(9) and the inertia I(X) in Eq. (10), update the 
neighborhood function for the next iteration. 
Affectation step: Affect each observation to a 
prototype co according to the minimum distance 
in Eq. (9): 



f(i) = Arg 



Mzn(D r (o z ,co c ))] 



The representation step and affectation step are 
sequentially computed up to convergence. The training 
parameters for the decreasing neighborhood function 
follow the usual recommendations for SOM algorithms : 
fast, then slow decrease (http://www.cis.hut.fi/projects/ 
somtoolbox/documentation/) . 

With convergence, if necessary for visualization of 
the final map, a referent observation can be associated 
to each prototype according to a "set Mean search" 
(or set Median) or a "Mean search" (or Median), for 
instance. 

In the following, we will compare three DSOM 
respectively called DSOM(K),DSOM(EG) and DSOM 
for our proposal. To compare the "set Mean" and "set 
Median" approaches for the three algorithms, d 2 (o., 
o.) will be substituted by d Y (o., o): "set Median" cor- 
responds to y = 1 and "set Mean" to y = 2. Different 
power values y will be also tested. Other transformations 
may be applied to a dissimilarity matrix to transform 
it into a distance matrix, such as adding a constant, or 
combining the both (Joly & le Calve, 1994). The "add- 
ing constant" method provides great distortions in the 



initial dissimilarity data. Our experiments confirm it. 
The "power" method gives better results. 

Concerning the computation time, these DSOM 
algorithms are equivalent, but the reasons differ. For 
DSOM(K) and DSOM(EG), the representation step 
is the most time- consuming one due to optimization 
for each referent. With our proposal, this optimiza- 
tion is implicit, but this step remains time-consuming 
because of the computation of the weights m. /c and 
inertia I(X c ). 

Methodology Description of the 
Experiment 

To evaluate the 3 DSOM algorithms, two metrics will 
be used. The first one is the classical quantization error 
(E ). The second one concerns topology preservation. 
Among existing criteria, we have chosen two measures 
in Eq. (11) which are compatible with dissimilarity 
data: the "trustworthiness" (M : ) and the "continuity" 
(M 2 ) (Venna & Kaski, 2001). The trustworthiness 
relates to the error provided by new observations in 
an output neighborhood while they are not in the in- 
put neighborhood; conversely for the continuity. M x 
and M 2 are evaluated in function of the number (k) of 
the nearest neighbors and normalized between and 
1. For visualization according to Venna & Kaski, the 
trustworthiness is more important than the continuity. 
The more M^k) and M 2 (k) are large, the better the 
projection quality is. We compute also the integrated 
M.(k) until a neighborhood with 10% of the whole 
samples: these values ( M z . ) measure the quality of the 
local topology preservation. 




M 1 (/c) = l- 
M 2 (/c)=l- 



X X (Ko.o.)-k) 



Nk(2N-3k-l)l£ 0j £ M 

X Y (r(o f ,o 7 )-/c) 



(11) 



With C k (o t ), C k (o z ) sets of k first neighbors of o. 
in the input space, output space; 

U k (o i ) = fa\o j eC k (o i )AO j £C k (o i )) 
V k ( 0i )= ^10,^(0^ e C k ( 0i )\ 
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r{o., o), rfo^Oj ] ranks of o. in the neighbourhood of 
o. in the input space, output space. 

Three databases are used. The first one is an artifi- 
cial dataset: 100 uniform samples in R 2 , dissimilarity 
data is the exact Euclidean distance, the configuration 
parameter y is set to 2. The second one is the "Chicken 
Silhouette" (http://algoval.essex.ac.uk:8080/data/se- 
quence/chicken/chicken.tgz). This data consists of 446 
samples (binary images of chicken parts) categorized in 
5 classes. The distance matrix is calculated according to 
"AngleCostFunction" (Barbara Spillmann, 2004) based 
on the local orientation of the sample contours. The 
third dataset is larger. It is extracted from the SCOWL 
word lists (http://wordlist.sourceforge.net/). After some 
reduction of plural and possessive forms from a small 
English dictionary, the dataset consists of 2000 words. 
The Levenshtein distance (Levenshtein, 1966) is then 
used to calculate the pair-wise dissimilarities. 



Results 

On the artificial dataset, the performances of the three 
algorithms are very similar (Table 1). With a vector 
SOM, the results are identical. The map is a hexagonal 
one with a grid of 5x5 neurons. 

As expected, the behaviour of the three algorithms 
differs with the real datasets. With the "Chicken" da- 
tabases, the map is a hexagonal one with a grid of 7x7 
neurons. DSOM presents the best topology preserva- 
tion according to M^k) and M 2 (k) (Fig. l.a), and the 
best compromise between quantization and topology 
preservation (Table 2). While varying y, we observe an 
evolution of these criteria. We notice that each algorithm 
exhibits a different value for the optimal power y: y 
= 1 for DSOM(K), y = 1.5 for DSOM(EG), y = 3 for 
DSOM. However, y = 1 can be considered as the best 
compromise for the three algorithms and will be used 



Table 1. Comparison of the quantization quality (E) and topology preservation (M ly M 2 ) 



Artificial, y = 2 



DSOM(K) 



DSOM(EG) 



DSOM 



M, 



M, 



0.0063 
0.9892 

0.9791 



0.0067 
0.9848 

0.9777 



0.0063 
0.9855 

0.9804 



Table 2. Comparison of the quantization quality (E) and topology preservation (M V M 2 ) 



Chicken, y = 1 


DSOM(K) 


DSOM(EG) 


DSOM 


E 

g 


11.7183 


12.0817 


11.7966 


M[ 


0.8923 


0.9040 


0.9360 


W 2 


0.8320 


0.8083 


0.8880 
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Figure 1. (a) Chicken database: Evolution ofM^k) andM 2 (k) with y =1, (b) SCOWL database: Evolution ofM 1 
and M 2 for different values of the power y 
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Figure 2. Chicken database: prototypes of the neurons for DSOM. Each color corresponds to one of five classes 
of chicken parts: wing, back, drumstick, thigh and back, and breast. 
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Figure 3. SCOWL database: Part of the final map. At the end, the referents are assigned with a <( set Median 
search ". For the particularity of referent 117, see the text. 
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to present the results. Figure 2 show the prototypes of 
all the nodes for DSOM. The neighbor nodes have the 
similar prototypes. The map is organized to respect the 
data clustered into the 5 classes as well as possible. 

For the third dataset, the hexagonal map is used 
with the grid of 12x12 neurons. The conclusions are 
the same. We present in Fig. Lb, the evolution of the 
integrated M.(/c)( M 1 ). The values are higher for DSOM 
and also less sensible to different values of y. Figure 3 
illustrates the central part of the map for y = 1 , where the 
organization of the referents with length of the words 
is evident. On this figure, only referent 117 ("present") 
does not belong to its partition. On the whole map, it is 
the case for 5 over 144 referents (3.5%). ForDSOM(K) 
and DSOM(EG), the results are 23.4% and 99.7% re- 
spectively. From these characteristics, we also observe 
a higher effectiveness of the proposed algorithm which 
is mainly due to the implicit reference. 



FUTURE TRENDS 

The proposed algorithm is based on the computation 
of a "pseudo" gravity centre for each prototype. This 
computing is justified by assumption of existence of 
a latent Euclidean space. That means the dissimilarity 
data must be isometric to a L2 norm. In practice, this 
requirement is very seldom strictly checked and an ap- 
proximation is often sufficient. Therefore, to completely 
validate this new DSOM, it is necessary to test it with 
more other data types and larger databases having a 
"ground truth" . The data organization is interpreted after 
projection into the final map, and the neighbourhood in 
the output map must reveal the main latent properties 
of the observations which must be in agreement with 
the "ground truth". 
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CONCLUSION 

This article presents a new affective algorithm for 
DSOM. Through the criteria of trustworthiness and 
continuity, this DSOM presents good topology pres- 
ervation. The main reason of this improvement comes 
from the representation step where it is possible to 
continuously adapt the referent of each prototype like 
with the vector model. To achieve it, we use an implicit 
reference during the representation step thanks to the 
Huygens theorem. Even if the Euclidean assumptions 
are not exactly verified in practice, the distortions due 
to this mismatching are in fact less important than the 
ones occurred with the collision effect which is a dif- 
ficult problem for the classical DSOM algorithms. This 
effectiveness is represented in this article by the better 
performance of the proposed algorithm compared to 
the other ones. 
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KEY TERMS 

Affectation Step: A part of the learning iteration 
where an observation is affected to the nearest prototype 
according to a predefined distance. 

Dissimilarity Data: Data in which all we know 
about the observations are pair-wise dissimilarities. 

Dissimilarity SOM: A SOM where all observations 
are described by a dissimilarity matrix. 
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Prototype: Referent of a node (neuron) on the 
map. 

Quantization Error: Error which appears when an 
observation is represented by a prototype. 

Representation Step: Apart of the learning itera- 
tion where the prototype is adapted to well represent 
its affected observations. 

Self-Organizing Map (SOM): A subtype of artifi- 
cial neural networks. It is trained using unsupervised 
learning to produce low dimensional representation of 



the training samples while preserving the topological 
properties of the input space. 

SOM Batch Algorithm: A version of SOM in which 
at an iteration all observations are available and used 
for computation. 

Topology Preservation: Preservation of the 
neighbourhood relation of the observations in the 
output space. It means that the observations which are 
neighbours in the input space should be projected in 
neighbour nodes. 
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INTRODUCTION 

Many Intelligent Tutoring Systems (ITSs) aim to help 
students become better readers. The computational 
challenges involved are (1) to assess the students' 
natural language inputs and (2) to provide appropri- 
ate feedback and guide students through the ITS cur- 
riculum. To overcome both challenges, the following 
non-structural Natural Language Processing (NLP) 
techniques have been explored and the first two are 
already in use: word-matching (WM), latent semantic 
analysis (LSA, Landauer, Foltz, & Laham, 1998), and 
topic models (TM, Steyvers & Griffiths, 2007). 

This article describes these NLP techniques, the 
iSTART (Strategy Trainer for Active Reading and 
Thinking, McNamara, Levinstein, & Boonthum, 2004) 
intelligent tutor and the related Reading Strategies As- 
sessment Tool (R-SAT, Magliano etal, 2006), andhow 
these NLP techniques can be used in assessing students ' 
input in iSTART and R-SAT. This article also discusses 
other related NLP techniques which are used in other 
applications and may be of use in the assessment tools 
or intelligent tutoring systems. 



different Natural Language Processing (NLP) tech- 
niques in their system. NLP systems may be structural, 
i.e., focused on grammar and logic, or non-structural, 
i.e., focused on words and statistics. This article deals 
with the latter. 

Examples of the structural approach include 
ExtrAns (Extracting Answers from technical texts 
question-answering system; Molla et al., 2003) which 
uses minimal logical forms (MLF; that is, the form 
of first order predicates) to represent both texts and 
questions and C-Rater (Leacock & Chodorow, 2003) 
which scores short-answer questions by analyzing the 
conceptual information of an answer in respect to the 
given question. Turning to the non-structural approach, 
AutoTutor (Graesser et al., 2000) uses LSA to analyze 
the student's input against expected sets of answers 
and CIRCSIM-Tutor (Kim et al, 1989) uses a word- 
matching technique to evaluate students ' short answers. 
The systems considered more fully below, iSTART 
(McNamara et a/., 2004) and R-SAT (Magliano et al, 
2006) use both word-matching and LSA in assessing 
quality of students' self-explanation. Topic models 
(TM) were explored in both systems, but have not yet 
been integrated. 



BACKGROUND 

Interpreting text is critical for intelligent tutoring sys- 
tems (ITSs) that are designed to interact meaningfully 
with, and adapt to, the users' input. Different ITSs use 



MAIN FOCUS OF THE CHAPTER 

This article presents three non-structural NLP tech- 
niques (WM, LSA, and TM) which are currently used 
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or being explored in reading strategies assessment 
and training applications, particularly, iSTART and 
R-SAT. 

Word Matching 

Word matching is a simple and intuitive way to estimate 
the nature of an explanation. There are two ways to 
compare words from the reader's input (either answers 
or explanations) against benchmarks (collections of 
words that represent a unit of text or an ideal answer) : ( 1) 
Literal word matching and (2) Soundex matching. 

Literal word matching - Words are compared 
character by character and if there is a match of suf- 
ficient length then we call this a literal match. An 
alternative is to count words that have the same stem 
(e.g., indexer and indexing) as matching. If a word is 
short a complete match may be required to reduce the 
number of false-positives. 

Soundex matching - This algorithm compensates 
for misspellings by mapping similar characters to the 
same soundex symbol (Christian, 1998). Words are 
transformed to their soundex code by retaining the first 
character, dropping the vowels, and then converting 
other characters into soundex symbols: 1 for h, p; 2 for 
f, v; 3 for c, k, s; etc. Sometimes only one consecutive 
occurrence of the same symbol is retained. There are 
many variants of this algorithm designed to reduce the 
number of false positives (e.g., Philips, 1990). As in 
literal matching, short words may require a full soun- 
dex match while for longer words the first n soundex 
symbols may suffice. 

Word-matching is also used in other applications, 
such as, CIRCSIM-Tutor (Kim et al, 1989) on short- 
answer questions and Short Essay Grading System 
(Ventura et al., 2004) on questions with ideal expert 
answers. 

Latent Semantic Analysis (LSA) 

Latent Semantic Analysis (LSA; Landauer, Foltz, & 
Laham, 1998) uses statistical computation to extract 
and represent the meaning of words. Meanings are 
represented in terms of their similarity to other words 
in a large corpus of documents. LSAbegins by finding 
the frequency of terms used and the number of co-oc- 
currences in each document throughout the corpus and 
then uses a powerful mathematical transformation to 
find deeper meanings and relations between words. 



When measuring the similarity between text-objects, 
LSA's accuracy improves with the size of the objects, 
so it provides the most benefit in finding similarity 
between two documents but as it does not take word 
order into account, short documents may not receive the 
full benefit. The details for constructing an LSAcorpus 
matrix are in Landauer & Dumais (1997). Briefly, the 
steps are: (1) select a corpus; (2) create a term-docu- 
ment-frequency (TDF) matrix; (3) apply Singular Value 
Decomposition (SVD; Press et al, 1986) to the TDF 
matrix to decompose it into three matrices (L x S x R; 
where S is a scaling, matrix). The leftmost matrix (L) 
becomes the LSA matrix of that corpus. The optimal 
size is usually in the range of 300-400 dimensions. 
Hence, the LS Amatrix dimensions become N x D where 
N is the number of unique words in the entire corpus 
and D is the optimal dimension (reduced from the total 
number of documents in the entire corpus). 

The similarity of terms (or words) is computed by 
comparing two rows, each representing a term vector. 
This is done by taking the cosine of the two term vec- 
tors. To find the similarity of sentences or documents, 
(1) for each document, create a document vector using 
the sum of the term vectors of all the terms appearing 
in the document and (2) calculate a cosine between 
two document vectors. Cosine values range from ±1 
where +1 means highly similar. 

To use LSA in the tutoring systems, a set of bench- 
marks are created and compared with the trainee's 
input. Examples benchmarks are the current target 
sentence, previous sentences, and the ideal answer. 
A high cosine value between the current sentence 
benchmark and the reader's input would indicate that 
the reader understood the sentence and was able to 
paraphrase what was read. To provide appropriate 
feedback, a number of cosines are computed (one for 
each benchmark). Various statistical methods, such as 
discriminant analysis and regression analysis, are used 
to construct the feedback formula. McNamara et al. 
(2007) describe various ways that LSA can be used to 
evaluate the reader's explanations: either LSA alone 
or a combination of LSA with WM. The final conclu- 
sion is that a fully-automated (i.e., less hand-crafted 
benchmarks construction), combined system produces 
the better results. 

There are a number of other intelligent tutoring 
systems that use LSA in their feedback system, for 
examples, Summary Street (Steinhart, 2001), Auto- 
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Tutor (Greasser et a/., 2000), and Tutoring System 
(Lemaire, 1999). 

Topic Models 

The Topic Models approach (TM; Steyvers & Griffiths, 
2007) applies a probabilistic model to find a relationship 
between terms and documents in terms of topics. A 
document is considered to be generated probabilistically 
from a number of topics where each topic consists of a 
number of terms, each given a probability of selection 
if that topic is used. By using a TM matrix, the prob- 
ability that a certain topic was used in the creation of 
a given document is estimated. If two documents are 
similar, the estimates of the topics within these docu- 
ments should be similar. TM is similar to LSA, except 
that a term-document frequency matrix is factored into 
two matrices instead of three: one is the probabilities of 
terms belonging to the topics (the TM matrix), the other 
the probabilities of topics belonging to the documents. 
The Topic Modeling Toolbox (Steyvers & Griffiths, 
2007) can be used to construct a TM matrix, 

To measure the similarity between documents, 
the Kullback Leibler distance (KL-distance: Steyvers 
& Griffiths, 2007) is recommended, rather than the 
cosine measure (which can also be used). Using TM 
in a tutoring system is similar to using LSA, where a 
set of benchmarks is defined and the reader's input is 
compared against each benchmark. The only different 
is the use of KL-distance instead of LSA-cosine value. 
The preliminary results of investigating TM in place 
of LSA (Boonthum, Levinstein, & McNamara, 2006) 
indicate that TM is as good as LSA alone (correla- 
tion between computerized-scores and human rating 
scores), but a little bit lower than a combined system 
using both WM and LSA. This suggests that the TM 
should be further investigated in combination with 
WM or LSA or both. 

TM is mostly used in document clustering (grouping 
documents based on relevancy or similar topics; Buntine 
et a/., 2005), data mining (Tuulos & Tirri, 2004), and 
search engines (Perkio etal., 2004). A variation on TM 
by Steyvers & Griffiths (2007), is Probabilistic Latent 
Semantic Analysis (PLSA; Hofmann, 2001) which 
models each document as generated from a number of 
hidden topics and each topic has its features defined 
as the conditional probabilities of word occurrences 
in that topic. 



iSTART and RSAT Applications 

iSTART (Interactive Strategy Trainer for Active Read- 
ing and Thinking) is a web-based, automated tutor 
designed to help students become better readers using 
multi-media technology. It provides adolescent to 
college-aged students with a program of self-explana- 
tion and reading strategy training (McNamara et ah, 
2004) called Self-Explanation Reading Training, or 
SERT (see McNamara et a/., 2004). iSTART consists 
of three modules: Introduction (description of SERT 
and reading strategies), Demonstration (illustration of 
how these reading strategies can be used), and Practice 
(hands-on practice of these reading strategies). In the 
Practice module, students practice using reading strat- 
egies by typing self-explanations of sentences. The 
system evaluates each explanation and then provides 
appropriate feedback to the student. If the explana- 
tion is irrelevant or too short compared to the given 
sentence and passage, the student is required to add 
more information. Otherwise, the feedback is based 
on the level of its overall quality. 

The computational challenge is to provide appropri- 
ate feedback to the students about their explanations. 
Doing so requires capturing some sense of both the 
meaning and quality of their explanation. A combi- 
nation of word-matching and LSA provided better 
results (comparing the computerized-score using NLP 
techniques to the human rating score and having higher 
correlation between these two sets of scores) than 
either separately (McNamara, Boonthum, Levinstein, 
& Millis, 2007). 

R-S AT (Reading Strategy Assessment Tool; Maglino 
et a/., 2007) is an automated web-based reading assess- 
ment tool designed to measure readers' comprehension 
and spontaneous use of reading strategies. The R-SAT 
is similar to the iSTART Practice module in the sense 
that it presents passages to the reader one sentence at 
a time and asks for the reader's input. The difference 
is that, instead of an explanation, R-SAT asks either 
an indirect ("What are your thoughts regarding your 
understanding of the sentence in the context of the 
passage?") or a direct question {e.g., Why did the 
miller want to marry the girl?") at pre-selected target 
sentences. The answers to the indirect questions are 
evaluated on how they are related to the given sentence 
and passage; the answers to the direct questions are 
assessed by comparing them to ideal answers. 
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The problem is to analyze the answers and gen- 
erate a set of scores for overall comprehension and 
strategy usage. Ultimately, these scores can be used 
as a pre-assessment for iSTART allowing the trainer 
to individualize the iSTART curriculum based on the 
reader's needs. R-SAT was initially proposed to use 
word-matching, LSA, and other techniques beyond 
LSA. However, during the course of development, 
word-matching was found to produce better results 
than LSA or in combination with LSA. 



FUTURE TRENDS 



al. (2006) show that the topic model similarly offers a 
wealth of possibilities in natural language processing. 
For R-SAT to measure a reader's comprehension and 
reading skills accurately, like iSTART it must also be 
able to understand, to some extent, what a reader says, 
especially when he/she is asked to describe their current 
thoughts. Although LSA is a good candidate, simple 
word matching against various benchmarks seems ad- 
equate to provide satisfactory results especially when 
aggregated over several explanations (see Magliano et 
al, 2006). It is also demonstrates that a combination 
of techniques produces better results than using one 
technique on its own. 



These three NLP techniques (WM, LSA, and TM) are 
used in the ongoing research on assessing and improv- 
ing comprehension skills via reading strategies in the 
R-SAT and iSTART projects. WM and LSAhave been 
extensively investigated for iSTART and to some ex- 
tent in R-SAT. The lack of success of LSA compared 
to the simpler WM in R-SAT is somewhat surprising 
and may be due to particular features of the algorithms 
used or to the variety of text genres used in R-SAT. 
Future work is planned with modified algorithms and 
substituting genre-specific LSA spaces for the general 
space now used. In addition TM needs further explora- 
tion, especially in its use with small units of text where 
the recommended Kullback Leibler distance has not 
proven particularly effective. 



CONCLUSION 

The purpose of this article is to describe three NLP 
techniques and how they can be used in assessment tools 
and intelligent tutoring systems. For iSTART to teach 
reading strategies effectively, it must be able to deliver 
valid feedback on the quality of the explanations that a 
reader produces and therefore the system must under- 
stand, at least to some extent, the explanation. Of course, 
automating natural language understanding has been 
extremely challenging, especially for non-restrictive 
content domains like explaining a freely-entered text t. 
Algorithms such as LSA open up a number of possibili- 
ties to systems such as iSTART: in essence LS Aprovides 
a 'simple' algorithm that allowed tutoring systems to 
provide appropriate feedback to students (see Landauer 
et al, 2007). The results presented in Boonthum et 
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KEY TERMS 

Intelligent Tutoring System (ITS): Also called 
Intelligence Computer-Aided Instruction (ICAI), a 
personal training assistant that captures the subject 
matter and teaching expertise and individualize the cur- 
riculum to meet each learner's needs in order to master 
the subject matter. Its main goal is to provide benefits 
of the one-on-one instruction: lessons are conducted 
at the learner's own pace; practices are interactive so 
the learner can improve their weaker skills; and real- 
time question answering clarify learner's doubts or 
misunderstanding; and an individualized curriculum 
based on the learner's needs. 

Kullback Leibler Distance (KL-distance): A 

natural distance function from a "true" probability 
distribution to a "target" probability distribution. It can 
be interpreted as the expected extra message-length per 
datum due to using a code based on the wrong (target) 
distribution compared to using a code based on the 
true distribution. 

Latent Semantic Analysis (LSA): A natural lan- 
guage processing technique that analyses relationships 
between a set of documents and terms within these 
documents. LSA was created in 1990 for informa- 
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tion retrieval and is sometimes called latent semantic 
indexing (LSI). 

LSA Cosine: A measurement of a relation between 
two vector-units. A unit can be as small as a word or as 
large as an entire document. It can be computed using 
the dot-product of two vectors where each vector is a 
representation of a unit (word, sentence, paragraph, or 
whole document). 

Probabilistic Latent Semantic Analysis (PLSA): 

A statistical techniques for the analysis of two-mode 
and co-occurrence data, which has applications in 
information retrieval and filtering, natural language 
processing, machine learning from text, and related 
areas. PLSA evolved from LSA but focuses more on 
the relationship of topics within documents. 

Protocols : Any verbal input that students or readers 
produce during a session. This can be a set of explana- 
tions or answers to direct questions. 



Self-Explanation and Reading Strategy Trainer 

(SERT): Pedagogy uses five strategies to help students 
become a better reader. The reading strategies include 
(1) comprehension monitoring, being aware of one's 
own understanding of the text; (2) paraphrasing, or 
restating the text in different words; (3) elaboration, 
using prior knowledge or experiences to understand 
the text (domain-specific knowledge-based inferences) 
or using common-sense or logic to understand the text 
(general knowledge based inferences); (4) predictions, 
predicting what the text will say next; and (5) bridging, 
understanding the relation between separate sentences 
of the text. 

Word Matching (WM): A simple way to compare 
words. Literal match is done by comparing character by 
character, while Soundex match transforms each word 
into a Soundex code, similar to phonetic spelling. 
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INTRODUCTION 

The verification of identity is becoming a crucial factor 
in our hugely interconnected society. Questions such as 
"Is she really who she claims to be?", "Is this person 
authorized to use this facility?" are routinely being posed 
in a variety of scenarios ranging from issuing a driver's 
license to gaining entry into a country. The necessity 
for reliable user authentication techniques has increased 
in the wake of heightened concerns about security and 
rapid advancements in networking, communication, and 
mobility. Biometric systems, described as the science 
in order to recognize an individual based on his or her 
physical or behavioural traits, is beginning to get ac- 
ceptance as a legitimate method in order to determine 
an individual's identity. Nowadays, biometric systems 
have been deployed in various commercial, civilian, 
and forensic applications as a means of establishing 
identity. 

In particular, this work presents a non-cooperative 
identification system based on facial biometric. 



BACKGROUND 

How do biological measurements qualify as being bio- 
metric? Any human physiological and/or behavioural 
characteristic can be used as a biometric characteristic 
as long as it satisfies the following requirements (Jain, 
Ross & Prabhakar, 2004): universality, distinctiveness, 
permanence, collectability. 

The choice of biometric identifiers has a major 
impact on the performance of the system. This choice 
depends greatly on the intended application of the 
system. Currently, some of the most widely used bio- 
metrics identifiers include fingerprints (Jain, Ross & 



Prabhakar, 2004, pp. 43-64), hand geometry (Sanchez- 
Reillo, Sanchez-Avila, Gonzalez-Marcos, 2000), iris 
(Jain, Ross & Prabhakar, 2004, pp. 103-121), face (Jain, 
Ross & Prabhakar, 2004, pp. 65-86), etc... 

Most biometric systems require co-operation on 
the part of the users in order to acquire their biometric 
data. Face identification, however, does not require this 
condition for its use, although it can be used. This is 
therefore its principal advantage over other biometric 
systems. Human face identification is an extensively 
studied field since the computational cost has not been 
turned out to be a drawback, due to the increasing im- 
portance of this kind of biometric identification in the 
access security to places such as airports, metros, train 
and bus stations. The process of facial identification 
incorporates two significant methods: detection (an 
individual from among a set) and identification (whether 
an individual is whom s/he claims to be). 

Face detection (Young-Bum Sun, Jin-Tae Kim & 
Won-Hyung Lee, 2002) involves locating the human 
face within an image captured by a video camera and 
taking that face and isolating it from the other objects 
captured within the image. 

Identification is comparing the captured face 
with other faces that have been saved and stored in 
a database. The basic underlying identification tech- 
nology of facial feature identification involves either 
eigenfeatures (facial metrics) or eigenf aces. Within this 
type of study a great variety of references can be found 
(Discrete Cosine Transform (DCT), Karhunen-Loeve 
(KL) Transform, Independent Component Analysis 
(ICA), Principal Component Analysis (PCA), etc). The 
greatest advantage of a facial identification system is 
its non-cooperative nature as it is a system which can 
work independently of user co-operation. 
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FACIAL IDENTIFICATION SYSTEM 

This article presents the two principal processes associ- 
ated with face identification: face detection and face 
identification. However, there also exist other aspects of 
facial identification system to be taken into account. In 
the face detection module the face capturing is shown, 
just when the camera takes a picture or frame. The im- 
age acquisition can be carried out using RGB images, 
Infrared (IR) images among other formats; recently 
thermal images are also being used. The choice of 
the image format depends on its applications, lighting 
conditions, location (indoor or outdoor system), and 
the degree of security. 

In the face identification module, a database can be 
found with the user information that must be located; 
therefore a supervised classification must be carried 
out. The parametrization submodule extracts the user 
features, and the classification system generates a 
model in order to difference our user/users versus the 
remainder of persons (see figure 1). 

Face Detection 

The challenges associated with face detection can be 
attributed to the following factors: Pose, presence or 
absence of structural components, facial expression, 
occlusion, image orientation, imaging conditions. 

There are many closely related problems with respect 
to face detection. Face localization aims to determine 
the image position of a single face; this is a simplified 
detection problem with the assumption that an input 
image contains only one face (Lam & Yan, 1994). The 
goal of facial feature detection is the detection of the 
presence and location of features, such as eyes, nose, 



nostrils, eyebrow, mouth, lips, ears, etc., with the as- 
sumption that there is only one face in an image (Zhiwei, 
& Oiang, 2006). Face recognition or face identification 
compares an input image against a database and reports 
a match, if found (Darrell, Gordon, Harville & Woodfill, 
2000). The purpose of face authentication is to verify 
the claim of the individual's identity in an input image 
(Crowley & Berard, 1997), while face tracking methods 
continuously estimate the location and possibly the 
orientation of a face in an image sequence in real time 
(Darrell, Gordon, Harville, & Woodfill, 2000, Zhiwei, 
& Qiang, 2006) (see figure 2). 

Several face detection systems have been introduced 
(Ming-Hsuan Yang, David Kriegman & Narendra 
Ahuja, 2002) (Yang, Ahuja, &Kriegman, 2000 ). There 
are many existing techniques to detect faces based on a 
single image. The techniques for face detection with a 
single image were classified into three categories. 

• Knowledge Based System: This approach de- 
pends on using rules about human facial features 
to detect faces. Human facial features such as two 
eyes that are symmetric to each other, a nose and 
mouth, and other distance features represent this 
feature set. After detecting features, a verification 
process is carried out to reduce false detection. 
This approach is good for frontal images, as is 
shown in figure 3 . The difficulty lies in translating 
human knowledge into known rules and to detect 
faces in different poses. 

Furthermore, the surrounding environment can 
also pose a problem. For example, changes in 
light sources can add or remove shadows from a 
face. Therefore, many variables should be con- 
sidered when designing a face detection system. 



Figure 1. Block diagram for a non-cooperative facial identification 
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Figure 2. Face detection examples in a motion picture captures 





For these reasons, in a non-cooperative system 
this technique suffers invariability. 
• Image Based System: In this approach, a pre- 
defined standard face pattern is used to match 
with the segments in the image to determine 
whether they are faces or not. It uses training 
algorithms to classify regions into face or non- 
face classes. Image-based techniques depend on 
multi-resolution window scanning to detect faces, 
so these techniques have high detection rates but 
are slower than the feature-based techniques. 
Eigenfaces (Yang, Ahuja, & Kriegman, 2000) 
and neural networks (Rowley, Baluja & Kanade, 
1998) are examples of image-based techniques. 
This approach has the advantage of being simple 
to implement, but it cannot effectively deal with 
variation in scale, pose and shape (Rein-Lien Hsu 
& Jain, 2002). 

Features Based System : This approach depends 
on extraction of facial features, which are not af- 
fected by variations of lighting conditions, pose, 
and/or other factors. These methods are classified 

Figure 3. A typical face image used in knowledge 
based methods 




according to the extracted features. Feature-based 
techniques depend on feature derivation and 
analysis to gain the required knowledge about 
faces. Features maybe skin colour, face shape, or 
facial features such as eyes, nose, etc.... Feature 
based methods are preferred for real time systems 
where the multi-resolution window scanning used 
by image based methods are not applicable. Hu- 
man skin colour is an effective feature used to 
detect faces, because although different people 
have different skin colours, several studies have 
shown that the basic difference is based on their 
intensity rather than their chrominance. Human 
faces have a special texture that can be used to 
separate them from different objects (Bojkovic, 
& Samcovic, 2006). The facial features method 
depends on detecting features of the face. 

Face Identification in Transform Domain 
Systems 

The detected faces always have variable conditions 
(lighting, expression, rotation, translation, etc), and 
therefore, images used to train can have some changes 
with respect to images from face detection. The use of 
Features or Knowledge Based Systems is a disadvantage 
due to the wide data variability from variable condi- 
tions. Therefore, transform domain systems are a good 
goal because they group the information and contribute 
more discrimination to the facial identification. 

Transform domain analysis is a commonly used 
image processing and a parameterization technique. 
In recent years some work has been done to extract 
transform domain features for image identification. Li 
et al. extract Fourier range and angle features to identify 
the palm-print image (Li, Zhang & Xu, 2002). Lai et al. 
use holistic Fourier invariant features to recognize the 
facial image (Lai, Yuen & Feng, 200 1 ). Another spectral 
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feature generated from singular value decomposition 
(S VD) is used by some researchers (Chellappa, Wilson 
& Sirohey, 1 995). However, Tian et al. indicate that this 
feature does not contain adequate information for face 
recognition (Tian, Tan, Wang & Fang, 2003). Hafed 
and Levine (2001) extract discrete cosine transform 
(DCT) feature for face recognition. They point out 
that DCT obtains the near-optimal performance of 
Karhunen-Loeve (KL) transform in facial information 
compression. And the performance of DCT is superior to 
those of discrete Fourier transform (FT) and other con- 
ventional transforms. By manually selecting the DCT 
frequency bands, their recognition method achieves 
a similar recognition effect to the Eigenface method 
(M. H. Yang, 2002) which is based on KL transform. 
Nevertheless, their method cannot provide a rational 
band selection rule or strategy. Nor can it outperform 
the classic Eigenface method. 

In addition, some extended discrimination meth- 
ods are proposed. Zhang et al. (2002) present a dual 
Eigenspace method for face recognition. In his work, 
W. Malina (2001), proposed several new discrimina- 
tion principles based on the Fisher criterion. Yang uses 
principal component analysis kernel (PCA) for facial 
feature extraction and recognition (Bartlett, Movellan 
& Sejnowski,, 2002), while Bartlett et al. (2002) ap- 
ply the independent component analysis (ICA) in face 
recognition. However, Yang shows that both ICA and 
PCA kernels need much more computing time than 
PCA. In addition, when the Euclidean distance is used, 
there is no significant difference in the classification 
performance of PCA and ICA (Bartlett, Movellan & 
Sejnowski, 2002). Jing et al. (2003) put forward a clas- 
sifier combination method for face recognition. This 
paper does not analyze and compare these extended 



discrimination methods, but limits itself to a comparison 
of major linear discrimination methods including the 
Eigenface method, the Fisherface method, DLDAand 
discriminated waveletface. 

The KL transform is an optimal transform for remov- 
ing statistical correlation. Of the discrete transforms, 
DCT approaches the KL transform (Hu, Worrall, Sadka 
& Kondoz, 2001). In other words, DCT has strong 
ability to remove correlation and compress images. 
Furthermore, DCT can be used by fast Fourier trans- 
form (FFT), while there is no fast realization algorithm 
for KL transform. Therefore, our approach sufficiently 
uses these favourable properties of DCT. 

The following table shows different systems based 
on different methods of face recognition with their cor- 
responding recognition rates. The databases used are 
ORL [ORL Database], Yale [Yale Database], AR-Face 
[AR Database] and FERET [FERET Database]. 



FUTURE TRENDS 

Recently, numerous methods that combine several facial 
features have been proposed to locate or detect faces. 
Most of them use global features such as skin colour, 
size, and shape to find face candidates, and then verify 
these candidates using different local parameterization 
methods. The challenge is to achieve invariability of 
the captured images from the conditions (light, shapes 
...) and positional changes (rotations, scales ...). The 
creation and development of new methods based on 
transform domain system will provide robust charac- 
teristics for achieving this invariability. 

With respect to facial identification, 3D techniques 
can be used for the purpose in this system, but the 



Figure 4. Face samples with different conditions (lighting and rotation) 
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computational cost is a major disadvantage for real 
time applications. Facial rebuilding with 3D techniques 
can obtain more information and any features can be 
extracted. Moreover, this system retains the non-co- 
operation quality. In the future, the use of the multi- 
modal systems with other biometric characteristics will 
generate a stronger and robust system. 



CONCLUSION 

Face recognition is a challenging and interesting 
problem. However, it can also be regarded as part of 
the wider attempt to solve one of the greatest chal- 
lenges to computer vision, that of object recognition. 
In particular, facial identification is becoming a very 
important biometric system in the battle to reduce 
global terrorism. Much research has already been 
carried out in this field, and bearing in mind the threat 
to security which the world is currently facing, there 
will undoubtedly be many more publications on facial 
identification in the future. 
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KEY TERMS 

Biometric System : This is a system which identifies 
persons from physical or behavioral characteristics. 
These characteristics are intrinsic to the individuals. 

Face Detection: The act of detecting a face from 
a frame or an image. 

Face Identification: This is a system which cre- 
ates a model from facial features in order to recognize 
persons. 

Independent Component Analysis (ICA): A 

computational method for separating a multivariate 
signal into additive subcomponents supposing the 
mutual statistical independence of the non-Gaussian 
source signals. 

Multi-Modal System: Use of different biometric 
system in order to identify or verify persons. 

Non-Cooperative Identification System: This is 
a system for identification which does not require the 
collaboration of a user in order to operate. The informa- 
tion for identification is obtained with the permission 
of the user. 

Supervised Classification: Classification system 
that generates a model using training samples, and it 
uses that model to establish an evaluation or test with 
other samples. 

Transform Domain System: This is a change 
from visible range to another different range, which 
transforms the information, providing other properties 
in this domain. 




1265 



1266 



Nonlinear Techniques for Signals 
Characterization 



Jesus Bernardino Alonso Hernandez 

University of Las Palmas de Gran Canada, Spain 

Patricia Henriquez Rodriguez 

University of Las Palmas de Gran Canada, Spain 



INTRODUCTION 

The field of nonlinear signal characterization and 

nonlinear signal processing has attracted a growing 
number of researchers in the past three decades. This 
comes from the fact that linear techniques have some 
limitations in certain areas of signal processing. Nu- 
merous nonlinear techniques have been introduced 
to complement the classical linear methods and as an 
alternative when the assumption of linearity is inap- 
propriate. Two of these techniques are higher order 
statistics (HOS) and nonlinear dynamics theory (chaos). 
They have been widely applied to time series charac- 
terization and analysis in several fields, especially in 
biomedical signals. 

Both HOS and chaos techniques have had a similar 
evolution. They were first studied around 1900: the 
method of moments (related to HOS) was developed 
by Pearson and in 1890 Henri Poincare found sensitive 
dependence on initial conditions (a symptom of chaos) 
in a particular case of the three-body problem. Both 
approaches were replaced by linear techniques until 
around 1960, when Lorenz rediscovered by coincidence 
a chaotic system while he was studying the behaviour 
of air masses. Meanwhile, a group of statisticians at 
the University of California began to explore the use 
of HOS techniques again. 

However, these techniques were ignored until 1980 
when Mendel (Mendel, 1991) developed system iden- 
tification techniques based on HOS andRuelle (Ruelle, 
1979), Packard (Packard, 1980), Takens (Takens, 
1981) and Casdagli (Casdagli, 1989) set the methods 
to model nonlinear time series through chaos theory. 
But it is only recently that the application of HOS and 
chaos in time series has been feasible thanks to higher 
computation capacity of computers and Digital Signal 
Processing (DSP) technology. 



The present article presents the state of the art of two 
nonlinear techniques applied to time series analysis: 
higher order statistics and chaos theory. Some meas- 
urements based on HOS and chaos techniques will be 
described and the way in which these measurements 
characterize different behaviours of a signal will be 
analized. The application of nonlinear measurements 
permits more realistic characterization of signals and 
therefore it is an advance in automatic systems devel- 
opment. 



BACKGROUND 

In digital signal processing, estimators are used in order 
to characterize signals and systems. These estimators 
are usually obtained using linear techniques. Their 
mathematical simplicity and the existence of a unifying 
linear systems theory made their computation easy. Fur- 
thermore, linear processing techniques offer satisfactory 
performance for a variety of applications. 

However, linear models and techniques cannot 
solve issues such as nonlinearities due to noise, to the 
production system of the signal, system nonlinearities in 
digital signal acquisition, transmission and perception, 
nonlinearities introduced by the processing method 
and nonlinear dynamics behaviour. Therefore, the ap- 
plication of linear processing techniques leads to less 
realistic characterization of certain systems and signals. 
As a result of the shortcomings of linear techniques, 
analysis procedures are being revised and nonlinear 
techniques are being applied in computing estimators 
and models and in signal characterization to increase 
the possibilities of digital signal processing. 

HOS is a field of statistical signal processing which 
has become very popular in the last 25 years. To date 
almost all digital signal processing have been based 
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on second order statistics (autocorrelation function, 
power spectrum). HOS use extra information which 
can be used to get better estimates of noisy situation 
and nonlinearities. 

Chaos theory (nonlinear dynamical theory) is a 
long-term unpredictable behaviour in a nonlinear dy- 
namic system caused by sensitive on initial conditions. 
Therefore, irregularities in a signal can be produced 
not only by random external input but also by chaotic 
behaviour. 

Both nonlinear techniques have been used in signals 
characterization and numerous automatic classification 
systems have been developed using HOS and chaos 
features in many fields. Texture classification (Coroyer, 
Declercq, Duvaut, 1997), seismic event prediction (Van 
Zyl, 2001), fault diagnosis in machine condition moni- 
toring through vibration signals (Samanta, Al.Balushi. 
& Al-Araimi, 2006), (Wang & Lin, 2003) and economy 
(Hommes & Manzan, 2006) are some examples. 

Their application in biomedical signals is espe- 
cially important. Nonlinear features have proven to 
be useful in voice, electrocardiogram (ECG) and 
electroencephalogram (EEG) signals characterization. 
Automatic classification systems between pathologi- 
cal and healthy voices have been implemented using 
nonlinear features (Alonso, de Leon, Alonso, Fer- 
rer, 2001) (Alonso, Diaz-de-Maria, Travieso, Ferrer, 
2005). Nonlinear characteristics have been used in 
the detection of electrocardiographic changes through 
ECG signal (Ubeyli & Guler, 2004), in the evaluation 
of neurological diseases using EEG signal (Gulera, 
Ubeylib & Guler, 2005), (Kannathal, Lim Choo Min, 
Rajendra Acharya & Sadasivan, 2005) and in diagnosis 
of phonocardiogram (Shen, Shen, 1997). 



NONLINEAR METHODS: CHAOS 
THEORY AND HIGHER ORDER 
STATISTICS APPLIED TO TIME SERIES 

Higher Order Statistics 

Higher Order Statistics, known as cumulants and their 
Fourier transform, known as polyspectra are extensions 
of second-order measures (such as the autocorrelation 
function and power spectrum). Some advantages of 
HOS over second-order statistics are: 



1. HOS give amplitude and phase information in 
the spectral domain, whereas second order sta- 
tistics only give amplitude information (Mendel, 
1991) (Nikias & Petropulu, 1993). Therefore, 
non-minimum phase signals and certain types of 
phase coupling (associated with nonlinearities) 
cannot be correctly identified by second-order 
statistics. 

2. HOS are blind to Gaussian processes whereas 
correlation is not (Mendel, 1991). Therefore, 
cumulants can be used in determining Gaussian 
noise levels in a signal, separating non-Gaus- 
sian signals from Gaussian noise, in harmonics 
components estimation or in increasing signal to 
noise ratio (SNR) when signals are contaminated 
with Gaussian noise. 

The second-order measures work properly if the 
signal has a Gaussian probability density function, but 
many real-life signals are non-Gaussian. Therefore, 
HOS are a powerful tool to work with non-Gaussian 
and nonlinear processes. 

Next, some higher order statistics measurements are 
shown and their usefulness in characterizing certain 
nonlinear phenomena is explained. 

Third Order Moment: Skewness 

Skewness is a third order moment and a measure of the 
asymmetry in a probability distribution. This measure- 
ment enables us to discriminate among different kind 
of data distribution as its value varies according to the 
asymmetry of a distribution. The skewness of a Normal 
distribution is zero (data symmetric about the mean), 
positive skewness corresponds to a distribution with a 
right tail longer and negative skewness to a distribution 
with a left tail longer. 

In most cases normal distribution is assumed, but 
data points are not usually perfectly symmetric. Skew- 
ness reflects positive or negative deviations from the 
mean and gives more realistic characterization of a 
data set. 

Fourth Order Moment: Kurtosis 

Kurtosis is a fourth order moment and a measure of 
whether the data in a probability distribution are peaked 
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or flat relative to a Normal distribution. Kurtosis is a 
measure of the data concentration about the mean, 
higher kurtosis means more of the variance is due to 
infrequent extreme deviations. 

Higher Order Cumulants 

Higher order moments are natural generalization of 
autocorrelation, while cumulant (Mendel, 1991) are 
nonlinear combinations of moments. The second order 
cumulant is the autocorrelation function. Higher order 
cumulants can be seen as a measure of gaussianility 
of a random process because cumulants higher than 
second order are zero in a gaussian process. 



ance of the chaos theory all irregular behaviour was 
interpreted as a stochastic behaviour and therefore 
unpredictable. Thanks to the chaos theory this is not 
necessarily true. For example, stochastic and chaotic 
systems have rich broadband power spectra and vary- 
ing phase spectra. So, in order to distinguish between 
stochastic and chaotic systems the chaos theory is a 
powerful new tool. 

A deterministic dynamical system describes the time 
evolution of a system in some phase space r e 9? m (m 
dimensional vectorial space), where a state is specified 
by a vector x e!R m . This evolution can be expressed 
by ordinary differential equations (Kantz & Schreiber, 
1997): 



Bispectrum 

Bispectrum is the Fourier transform of the third order 
cumulant. The bispectrum of a stationary Gaussian 
process with zero media are equal to zero. The bispec- 
trum of a signal plus Gaussian noise is the same as that 
of the signal, whereas the power spectrum of a signal 
plus Gaussian noise is very different from the power 
spectrum of the signal alone. 

Therefore, through bispectrum Gaussian noise can 
be separated from non-Gaussian noise and signal-to- 
noise ratios can be improved. 

On the other hand, quadratic phase coupling can 
be detected and no minimum phase systems can be 
identified with the bispectrum. 

Bicoherence 

Closely related to the bispectrum is the third-order 
coherence measure, the bicoherence. Bicoherence is 
the bispectrum normalized. 

Bicoherence is bounded between and 1 values 
and it is used to detect quadratic phase coupling due 
to second order alinearities. A phase coupling between 
a linear combination of the frequency components co 1 
and oo 2 exists if the bicoherence has a value equal to 
one for a pair of frequencies (cd 1? co 2 ). 

Chaos Theory 

The Chaos theory helps us to understand and interpret 
the observations from complex deterministic dynamical 
systems and it can be used to predict and control time 
series (Kantz & Schreiber, 1997). Until the appear- 



— x(t)=f(t,x(t)),te9t 

at 



or in discrete time t = nAt by maps: 
Xn+i = F(x n ),n e Z 

A sequence of points (x n or x(t)) that solve the equa- 
tions of the system are called trajectories. The initial 
conditions are x or x(0), respectively. The region of 
the phase space in which all trajectories originated in 
a range of initial conditions converges after a transi- 
tion time is called attractor. An example of a chaotic 
attractor from the Colpitts oscillator (Kennedy, 1994) 
is illustrated in Figure 1. 

Most of the time we need to characterize nonlinear 
systems for which equations and models are unknown. 
However, some measurements of the system are known. 



Figure 1. Attractor from Colpitts oscillator 
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There exist some techniques to obtain the phase space 
and the attractor from the output signal (embedding 
techniques). Thus, certain quantities such as Lyapunov 
exponents, correlation dimension and Kolmogorov- 
Sinai entropy are obtained from the attractor. These 
quantities provide measurements of the nonlinearity 
degree of the system. These measurements are invariant 
under smooth transformations and thus independent of 
the embedding procedure. 

Embedding Techniques 

Takens' embedding theorem (Takens, 1981) states 
that an embedding exists if the dimension (m) of the 
reconstructed phase space is such that m>2D+ 1 (D is the 
attractor dimension). There exist two main methods to 
reconstruct the attractor from a time series: the method 
of delays (Kantz & Schreiber, 1997) and principal 
component analysis (Broomhead & King, 1986). The 
former method is the most popular: a delay reconstruc- 
tion in m dimensions is formed by the vectors s n given 
as (Kantz & Schreiber, 1997), 

s n =[s(n),s(n-T),...,s(n-(m-l)T)] 



Lyapunov Exponents 

Lyapunov exponents characterize the rate of separa- 
tion of two points in phase space initially separated 
by a small distance. There exist as many Lyapunov 
exponents as m (dimension of the phase space). The 
maximal Lyapunov exponent (MLE) is the largest 
one and determines the predictability of a dynamical 
system. A positive MLE means divergence of nearby 
trajectories, i.e. chaos. For a mathematical descrip- 
tion we refer the reader to (Kantz & Schreiber, 1997). 
Several algorithms to compute Lyapunov exponents 
from a time series have been implemented (Wolf et. al, 
1985), (Rosenstein, Collins, De Luca, 1993), (Kantz, 
1994), (Sprott, 2003). 

MLE is useful to characterize different kinds of 
behaviour in a signal or system. A negative MLE is an 
indicator of a stable fixed point (a dissipative or non- 
conservative system), a positive MLE is an indicator of 
irregular (chaotic) behaviour, a zero MLE is an indicator 
of a conservative system (such as a harmonic oscillator) 
and an infinite MLE is an indicator of noise. 

Kolmogorov-Sinai Entropy 




where s(n) is the scalar signal measured, m is the em- 
bedding dimension of the reconstructed phase space 
and T is the time delay. 

Takens' theorem is strictly an existence theorem 
and does not suggest how to find the embedding di- 
mension (m) and the time delay (J). The first zero of 
autocorrelation function or when it decays \ has been 
suggested as a first order estimator of T. The first 
minimum of mutual information function (Fraser & 
Swinney, 1986) is another estimator of Tthat takes 
into account nonlinear correlations. 

The false neighbours method (Kennel, Brown & 
Abarbanel, 1992) and the false strands method are 
proposed methods to estimate the embedding dimen- 
sion (m). The latter is an improvement of the false 
neighbours method. 

Chaotic Measurements 

In the following paragraphs some chaotic measurements 
will be described. 



Kolmogorov-Sinai (KS) entropy quantifies the loss of 
information as a system evolves and it is another meas- 
urement related to the unpredictability of a system. In 
a regular and predictable system, H KS = 0, i.e. nearby 
points are closely grouped in some other small region 
of phase space and there is no change in information. 
In a random process H KS = oo due to the fact that all 
phase space regions become possible after a short time. 
In chaotic systems < H KS < oo indicates that nearby 
points in the phase space diverge exponentially. There- 
fore, according to KS entropy values different types 
of systems can be characterized: regular, chaotic and 
noise systems. 

Correlation Dimension 

Correlation dimension (Grassberger & Procaccia, 
1983) quantifies the complexity of the reconstructed 
attractor. It is a geometric measurement of sensitive 
dependence on initial conditions because in chaotic mo- 
tion the attractor usually shows a very complicated and 
fractal geometry. In a chaotic deterministic system the 
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correlation dimension yields to a finite value, whereas 
in a random process it does not converge to a value. 
A maximum likelihood estimator to obtain optimal 
values of correlation dimension is the Takens-Theiler 
estimator (Theiler, 1988). 

Correlation dimension allows us to identify a random 
process from a chaotic motion. A non-integer (fractal) 
value of the correlation dimension is usually a symp- 
tom of chaos, whereas a integer value is a symptom 
of a regular behaviour. Furthermore, the correlation 
dimension is an estimation of the number of degrees 
of freedom of a system. 



FUTURE TRENDS 

In automatic recognition systems it is necessary to 
characterize data sequences and objects (voice, sounds, 
faces, hands, etc.) in order to achieve a well described 
features space. Having differential features will later 
lead to a successful classification process. 

However, the task of finding differential features 
is not always easy. Nonlinear techniques are novel re- 
sources to characterize time series and overcome certain 
previous problems of linear techniques. Proof of this 
is the development of several automatic classification 
systems using nonlinear features such as (Alonso, de 
Leon, Alonso, Ferrer, 2001) (Alonso, Diaz-de-Maria, 
Travieso, Ferrer, 2005), (Ubeyli & Guler, 2004), 
(Gulera, Ubeylib & Guler, 2005). 



CONCLUSION 

In this article we have shown the state of the art in 
two recent nonlinear techniques: Higher order statis- 
tics and the chaos theory. The main point is the fact 
that many signals in real life cannot be adequately 
modelled by linear approximation alone. Recently, the 
development of packages to compute chaotic (TISEAN 
package, Hegger, Kantz & Schreiber, 1999) and HOS 
(HOSA toolbox for Matlab) measures from data sets 
has made the application of these techniques to data 
sets feasible. 

Thanks to these techniques it is now possible to 
extract new characteristics previously ignored by linear 
analysis. Therefore the use of nonlinear techniques 



leads to more realistic characterization of signals and 
systems. 

These new approaches to signal analysis and 
characterization provide new tools for the better char- 
acterization of signals and as a previous step in order 
to create new, more accurate and powerful automatic 
systems in patter recognition systems such as voice 
and facial recognition. 
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KEY TERMS 

Attr actor: A region in the phase space to which all 
trajectories converge after a transition time. It is the 
long term behaviour of a dynamical system. 

Bicoherence: It is a normalised version of the 
bispectrum. The bicoherence takes values bounded 
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between and 1, which make it a convenient measure 
for quantifying the phase coupling in a signal. 

Chaos: Long-term unpredictable behaviour caused 
by sensitive dependence on initial conditions. 

Cumulants: The kth order cumulant is a function 
of the moments of orders up to and including k. 

HOS: Higher order statistics is a field of statistical 
signal processing that uses more information than au- 
tocorrelation functions and spectrum. It uses moments, 
cumulants and polyspectra. They can be used to get 
better estimates of parameters in noisy situations, or 
to detect nonlinearities in the signals. 

Kolmogorov-Sinai Entropy: Measurement of 
information loss per unit of time in phase space. 
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Lyapunov Exponents: Quantity that characterizes 
the rate of separation of infinitesimally close trajectories 
in a dynamical system. The maximal Lyapunov expo- 
nent (MLE) determines the predictability of a dynamical 
system. A positive MLE means a chaotic system. 

Polyespectra: The Fourier transform of cumulants. 
The second order polyspectra is the power spectrum. 
Most HOS work on polyspectra focusses attention on 
the bispectrum and the trispectrum. 

Reconstructed Phase Space: Phase space obtained 
from a time series through embedding techniques 
such as principal component analysis or the method 
of delays. 
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INTRODUCTION 

The researchers currently have a new tool for dealing 
with the solution of biomedical problems: the Microar- 
rays. These devices support the study and the acquisition 
of information related to many genes at the same time 
by means of a unique experiment, providing multiple 
potential applications such as mutation detection of 
microorganism identification. 

Some of the problems that exist when working 
with this type of technologies are the high number of 
data and the complex technical nomenclature to be 
dealt with. These facts imply the need of using several 
standards and ontologies when performing this type 
of experiments. 



BACKGROUND 

The microarrays have been a key element in the bio- 
technological revolution of the last years; however new 
problems regarding both, data handling and statistics 
analysis, have arisen due to the vast volume of infor- 
mation and to the structure of the data used. 

The main concern lies in the vast amount of data 
to be stored, processed and analysed. Besides, as the 
microarrays are a new technique, most of the methods, 
protocols and standards are still being defined. 

The fact of dealing with such amount of unstructured 
information leads to believe that is quite difficult for 
the descriptors of the stored concepts or their units to 
be the same at the different data bases where it is ac- 
cessed. In order to support the vocabulary unification 
task, the ontologies (Chandrasekaran, 1999) enable 



a hierarchical definition of concepts for framing the 
schemas of the accessed data bases. There are fully 
established ontologies also quite used as the UMLS 
medical vocabulary (UMLS, 2006), that has informa- 
tion about symptoms and illnesses, or the GO (Gene 
Ontology) genomic ontology (Gene Ontology, 2006), 
regarding information about the function and the ex- 
pression location of the different human genes. 

Once the use of ontologies has been established, they 
are also quite useful for searching hidden relationships 
among data. Consultations with SQL-type (Structured 
Query Language) (Beaulieu, 2005) query languages 
may be performed in an ontology and translated to query 
languages owning to each underlying data base. In this 
way, by the use of the ontology, it could be known that 
the presence of fever is a symptom and which are the 
illnesses that present fever as a symptom. 

Currently, there are special data formats in medicine 
science as the DICOM standard (Oosterwijk, 2001) 
for storage and transfer of the increasing amount of 
medical images that support new imaging modalities. 
Nevertheless, the typical biomedical images, as the 
microarrays or the DNA gels, are not currently con- 
sidered at DICOM, although their future integration is 
foreseeable in incoming revisions, as the clinical test 
based on these techniques might be increasingly used in 
routine medical practice. At the moment, however, the 
management of this type of images is quite sensitive. 



MAIN FOCUS OF THE CHAPTER 

This paper presents a description of the most important 
standards and ontologies for working with microarrays 
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experiments; it also tackles the integration options of 
some of these ontologies and standards into an infor- 
mation system for managing microarrays. 

The first standardisation initiatives appeared in 1 998 . 
They were more or less isolated initiatives where three 
standardisation areas could be distinguished: hardware, 
fixed material and procedures for analysis and storage 
of studies information. Several organisations as the 
MGED Normalization Working Group (MGED Data, 
2006) were created for the standardisation of the in- 
formation. The MGED (Microarray Gene Expression 
Data) Society is an international organisation devoted to 
the standardisation and to the exchange of information 
related to microarrays experiments. Other organisations 
to be mentioned are the OMG (Object Management 
Group) (OMG, 2006) or the UCL/HGNC (Human 
Gene Nomenclature) (HGNC, 2006). 

As far as terminologies, vocabularies, nomenclatures 
and ontologies is concerned, it should be highlighted 
the MGED Ontology (MGED OWG, 2006), which 
describes the experiments and the gene expression data, 
or the GO (Gen Ontology Consortium) (Gene Ontol- 
ogy, 2006), which provides controlled vocabularies for 
describing the molecular function, the biological process 
and the cellular components of the gene products. Also 
the UCL/HGNC (Human Gene Nomenclature) (HGNC, 
2006), the TaO (TAMBIS Ontology) (TaO, 2006), the 



RiboWeb (RiboWeb, 2001) or the EcoCyc (EcoCyc, 
2005) should be mentioned. 

Regarding the data exchange standards in the mi- 
croarrays field, the MicroArray and Gene Expression 
Markup Language (MAGE-ML) (MAGE-ML, 2006) is 
language designed for describing and communicating 
information among microarrays experiments. 

Other data exchange standards are the Bioinf ormatics 
Sequence Markup Language (BSML) (BSML, 2006), 
the Gene Expression Markup Language (GeneXML) 
(NCGR, 2006) or the Genome Annotation Markup 
Elements (GAME) (Bioxml, 2006). 

The MGED Group is the standardisation organi- 
sation that presents the wider scope regarding the 
microarrays field and presented in November 2000 
the standard MIAME (Minimun Information About 
a Microarray Experiment) (MIAME, 2006). This 
acronym describe the minimal information regarding 
microarrays that, either should be stored into a data 
base (from now, DD.BB) used as a public repository, 
or that should be stored for enabling the non ambigu- 
ous interpretation of the experiments results and for 
repeating such experiments. 

After defining the information that is going to be 
stored (MIAME), there should be a model of objects 
(UML) for describing, not only how the data of these 
experiments should be expressed, but also the mecha- 



Figure 1. MGED ontology 
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nisms for their exchange, bearing always in mind the 
MIAME guides. This is precisely what the MAGE-OM 
(Micro Array and Gene Experiment Object Model) 

(MAGE-OM, 2006) standard defines. 

This model of objects has been developed for being 
independently used from the implementation chosen 
and, in this way, it can be used as a map for data struc- 
tures in platforms such as Java, Perl or C++. The model 
has been currently translated to a set of relational tables 
divided in packages, according to the natural separation 
of the gene expression data into cases and objects. 

In this point, and by the use of standards already 
described, the microarrays experiments data to be 
stored and their model of objects are both defined. A 
language for the data exchange is therefore needed, 
as the MAGE-ML (MicroArray Gene Expression 
Markup Language (MAGE-ML, 2006). 

It is a XML (XML, 2006) formal language directly 
derived from the MAGE-OM object model. This lan- 
guage has been designed for describing and commu- 
nicating the information of such type of elements and 
it can be used for describing microarrays-related items 
such the designs, information about the fabrication or 
the structure of experiments. 

A tool named as MAGE-stk (MAGE Software 
Toolkit) has been developed in order to simplify the 
use of the MAGE-OM standard. This tool is based on 
an Open Source package collection that implements 
the MAGE model of objects (MAGE-OM) in several 
programming languages. It makes the reading of the 
MAGE-ML easier; this tool also simplifies the MAGE- 
ML writing from MAGE-OM and it provides methods 
for the fully maintenance, as well as actualisation, of 
MAGE-OM. 

Once the standards needed for working with microar- 
rays technology have been defined, the following step 
is the description and the use of several ontologies that 
might enable, as it was mentioned before, the unifica- 
tion of the different vocabularies used. 

The MGED Ontology (MO) is one of the most 
important ontologies when working with microar- 
rays and, particularly, when using certain previously 
mentioned standards. The main goal of this ontology 
is to provide standard terms for the notation of ex- 
periments with microarrays; such terms not only will 
serve for structuring questions related to the elements 
of the experiments, but also they might be used for 
unambiguously describing how the experiments have 
been done. 



As the ontology-encoded terms will be eventually 
placed in MAGE-ML documents, the efforts of both, 
MAGE and the working group, should be coordinated 
at the points where they superimpose, for the ontology 
classes and the MAGE classes to have the same names 
and relationships. 

The ontology has been conceived for continuously 
growing and therefore fulfilling the requirements of 
descriptive terms related to emerging applications of 
microarrays. Besides, the use of ontologies for software 
programming should be fixed, in order to avoid constant 
revisions of the programming for searching changes in 
vocabularies and relationships. The fulfilment of such 
objectives is achieved by establishing the central MGED 
ontology, a nucleus at the MGED ontology that will 
remain constant. The extended MGED ontology is a 
second ontology layer that contains all the additional 
terms that might be considered (see Figure 1). 

The central MGED ontology has been developed for 
working with the MAGE 1.0 schema, and it is restricted 
to MAGE-OM vl.l. The extended MGED ontology 
increases the ontology nucleus with terms that are out 
of reach of MAGE vl.l. 

The Gen Ontology (GO) is other ontology that 
should be considered when working with microarrays. 
The Gen Ontology Project implies a collaborative ef- 
fort in order to fulfil the needs of consistent descriptors 
for genetic products in different DD.BB. The project 
started in 1998 as collaboration among three DD.BB. 
related to models of organisms: FlyBase, Saccharomy- 
ces Genome Database (SGD) and the Mouse Genome 
Database (MGD). Since then, the GO consortium has 
grown and includes many more DD.BB., as some of 
the world biggest repositories for plants, animals and 
microbial genomes. 

The GO project has developed three controlled 
and structured ontologies/vocabularies: biological 
processes, molecular functions and cellular compo- 
nents. In this way, a given gene can be located in one 
or more cellular components, the biological processes 
where it is active can be checked and the molecular 
functions represented by that gene at those processes 
can be visualised. For instance, the ' cytochrome c' 
gene can be described by the molecular function term 
'oxidoreducta activity', by the biological process terms 
'oxidate phosphorylation' and 'death cell induction' 
and by the cellular component terms 'mitochondrial 
matrix' and 'mitochondrial membrane'. 
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ESSENTIAL CHARACTERISTICS OF AN 
INFORMATION SYSTEM FOR 
MICROARRAYS MANAGEMENT 

This type of system needs an architecture of data inte- 
gration for easily store the vast amount of information 
generated by the experiments with microarrays; In 
order to achieve this, the architecture should provide 
the users with assistants and contextual support for 
handling the information. On the other hand, a Web 
architecture, by means of an Internet connection, will 
enable the access and the management of the informa- 
tion from any place at any time. 

For the ontology information to be always actualised 
and available for the users, the architecture should 
provide an integrated access to several ontology serv- 
ers. In order to achieve this, it should be advisable to 
use Web Services in cases as the access to the Gen 
Ontology (GO) or to the Biological Imaging Methods 
(FBbi); alternatively, Internet access should be used 
in MGED Ontology access. Besides, for the users to 
introduce data and consult the stored information more 
easily, the system should have an interface that might 
show a list of ontology terms and values; in this way, 
this list would enable ontology consultations that might 
include all the meanings of a given concept. 

As the proposed system has to support the informa- 
tion exchange among the different researchers, this type 
of architecture should use the existing standards related 
to data storage (MIAME) and to information exchange 
(MAGE-OM y MAGE-ML). In the first case, the system 
should have to implement a DD.BB. whose fields fulfil 
the MIAME standard; in the second case, the system 
will use the MAGE-OM object model for enabling the 
generation of the MAGE-ML information exchange file 
by the users whenever they might require it. 

Lastly, it should be also advisable that the users could 
continue using the existing applications, to which they 
are used to, and that have been developed by experts 
on the subject usually using the R language. Due to 
that reason, the system should have such applications 
available for the users. In order to achieve this, it is 
proposed an approach based in the use of Web services 
by the architecture. 

This architecture is being currently developed by 
the RNASA/IMEDIR lab group from the University 
of A Coruna. 



CONCLUSION 

Nowadays there are several tools that enable the analysis 
of microarrays imaging; however, as they are software 
specifically designed for each array type, they do not 
allow wide options and they, not only require to be 
installed in the user machine, but also its installation 
is restricted to a few operative systems. 

Regarding data processing, there are several proj- 
ects that include packages for performing microarrays 
imaging processings as normalisation or clustering; 
however, some of these packages need to download 
the different processing tools that they contain in order 
to use them. 

Lastly, there are several types of public DD.BB. 
for storing the information of this type of experiments 
by the use of Web formularies. As there are also some 
stand-alone tools that store the data into a DD.BB. 
created in the machine of the user, this machine should 
have a DD.BB. manager installed. 

Nevertheless, no systems have been found to perform 
the different steps without needing to install software 
or to quit the system. 

The new systems of this area should allow the data 
storage into a MIAME standard DD.BB. with the op- 
tion of performing the image analysis of the different 
microarrays experiments and keeping the analysis re- 
sults into de system DD.BB. The systems should also 
provide several processing types using R language in 
order to perform data analysis and subsequent experi- 
ment conclusions. The data model of the system should 
use MAGE-OM standard and then offer the resulting 
experiment MAGE-ML file to the user. 
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KEY TERMS 

MAGE-ML: Microarray Gene Expression Markup 
Language. Formal language designed for describing 
and communicating the experiment-based microarrays 
information. 

MAGE-OM: MicroArray and Gene Experiment 
Object Model. Standard that defines the model of objects 
for the gene expression-based experiments. 

MAGE-stk: MAGE Software Toolkit. Open 
Source Package collection that implements the MAGE 
(MAGE-OM) model of objects in several programming 
languages. 

MIAME: Minimum Information About a Microar- 
ray Experiment. Standard that indicates the minimal 
information needed for microarrays experiments. 

Micro Arrays: A technology using a high-density 
array of nucleic acids, protein, or tissue for simultane- 
ously examining complex biological interactions which 
are identified by specific location on a slide array. A 
scanning microscope detects the bound, labelled sample 
and measures the visualized probe to ascertain the 
activity of the genes of interest in genotyping, cellular 
studies, and expression analysis. 

Ontology: In computer science this term refers to 
the attempt of formulate an exhaustive and rigorous 
conceptual schema into a given domain, with the aim 
of making communication and information sharing 
among systems easier. 

R: Language and programming environment for 
graphic and statistical analysis. 



1277 



1278 



Ontologies for Education and Learning Design 



Manuel Lama 

University of Santiago de Compostela, Spain 

Eduardo Sanchez 

University of Santiago de Compostela, Spain 



INTRODUCTION 

In the last years, the growing of the Internet have 
opened the door to new ways of learning and educa- 
tion methodologies. Furthermore, the appearance of 
different tools and applications has increased the need 
for interoperable as well as reusable learning contents, 
teaching resources and educational tools (Wiley, 2000). 
Driven by this new environment, several metadata speci- 
fications describing learning resources, such as IEEE 
LOM (LTCS, 2002) or Dublin Core (DCMI, 2004), 
and learning design processes (Rawlings et al., 2002) 
have appeared. In this context, the term learning design 
is used to describe the method that enables learners to 
achieve learning objectives after a set of activities are 
carried out using the resources of an environment. From 
the proposed specifications, the IMS (IMS, 2003) has 
emerged as the de facto standard that facilitates the 
representation of any learning design that can be based 
on a wide range of pedagogical techniques. 

The metadata specifications are useful solutions to 
describe educational resources in order to favour the 
interoperability and reuse between learning software 
platforms. However, the majority of the metadata stan- 
dards are just focused on determining the vocabulary to 
represent the different aspects of the learning process, 
while the meaning of the metadata elements is usually 
described in natural language. Although this description 
is easy to understand for the learning participants, it 
is not appropriate for software programs designed to 
process the metadata. To solve this issue, ontologies 
(Gomez-Perez, Fernandez-Lopez, and Corcho, 2004) 
could be used to describe formally and explicitly the 
structure and meaning of the metadata elements; that is, 
an ontology would semantically describe the metadata 
concepts. Furthermore, both metadata and ontologies 
emphasize that its description must be shared (or stan- 
dardized) for a given community. 



In this paper, we present a short review of the main 
ontologies developed in last years in the Education 
field, focusing on the use that authors have given to 
the ontologies. As we will show, ontologies solve is- 
sues related with the inconsistencies of using natural 
language descriptions and with the consensous for 
managing the semantics of a given specification. 



ONTOLOGIES IN EDUCATION 

In the educational domain a number of ontologies 
have been developed for authors. Thus ontologies 
have been developed to describe the learning contents 
of technical documents and formalize the semantics 
of learning objects; model the elements required for 
the design, analysis, and evaluation of the interaction 
between learners in computer supported cooperative 
learning; and describe the learning design associated 
to a unit of learning in which the learning flow is ex- 
plicitly declared. 

Ontologies in Learning Contents and 
Metadata 

The main purpose of these ontologies is to describe the 
contents or features of documents in order to favor its 
indexing and retrieval from applications. Thus Kabel, 
Wielinga, and Hoog (1999) develop three ontologies 
that annotate technical documents from a given domain: 
these documents are converted in a large collection of 
information elements described by a number of attri- 
butes to which values are assigned from the ontologies. 
These attributes are referred to the subject matter in 
the application domain, structural and representational 
properties (paragraphs, sections, etc.) and the poten- 
cial instructional roles of the information elements. 
Following this approach the ontologies represent the 
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semantics of the documents, enabling its indexing and 
retrieving from databases. 

Other interesting ontology in this field is proposed 
by Brase, Painter and Nejdl (2004). Using an ontol- 
ogy language as TRIPLE, this ontology describes the 
semantics of the LOM specification, adding formal 
axioms and rules to the metadata representation of the 
standard. With this formal description the semantics 
of the LOM specification is not changed, but it helps 
to define the constraints on LOM fields, making clear 
the meaning and use of these LOM fields, resulting in 
easier exchange of LOM metadata between different 
applications and contexts. 

Ontologies in Collaborative Learning 
Environments 

These ontologies are used to model the interaction 
between the learning actors (typically teachers and 
students) in collaborative environments. Thus Inaba et 
al. (2001) present an ontology a collaborative learning 
ontology that facilitates the design, analysis, and evalu- 
ation of a collaborative learning sesion. This ontology 
describes the concepts of several well-established learn- 
ing theories, defining the semantics of what learning 
goal concept is and connecting this concept with the 
theories which are formulated in a taxonomy. In this 
work, authors have used the ontology to facilitate users 
the design and execution of the instructional process 
in a collaborative environment (Barros, Verdejo, Read, 
& Mizoguchi, 2002). 

Ontologies in Learning Design 

These ontologies focus on the semantic description of 
the learning design modelling which defines the learn- 
ing flow of the activities to be carried out by teachers 
and students. The ontologies developed in this field 
are based on the IMS Learning Design (IMS LD) 
specification which has risen as a de facto standard 
for defining learning designs. This specification has: 
(1) a well-founded conceptual model that declares the 
vocabulary and the functional relations between the con- 
cepts of the learning design; (2) an information model 
that describes in an informal (natural language) way 
the semantics of every concept and relation introduced 
in the conceptual model; and (3) a behavioural model 
that specifies the constraints imposed to the software 
system when a given learning deisgn is executed in 



runtime. In other words, the behavioural model defines 
the semantics of the IMS LD specification during the 
execution phase. Figure 1 depicts the main concepts 
of the IMS LD specification. 

Knight, Gasevic and Richards (2006) present a 
general framework whose prupose is to save the gap 
between learning designs and the learning objects used 
in them. For achieved this, the framework considers 
the development of three ontologies that describe the 
learning design, the learning objects and the context in 
which these objects are used. LOCO is the ontology, 
defined in the language OWL (Dean & Schreiber, 2004), 
that deals with the description of learning designs. It 
represents the semantics specified in IMS LD and, 
particularly, in its conceptual model, which means that 
LOCO integrates the concepts and relations defined 
in the conceptual and information models of the IMS 



Figure 1. Main concepts of the IMS Learning Design 
specification (Amorim et al, 2006) 
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Table 1. Examples of axioms that constrain the semantics of the IMS LD concepts 



Design 
Axiom 1 



Page 38 (item 0.2.2): "The time limit specifies that it is completed when a certain amount of time 
has passed, relative to the start of the run of the current unit of learning. The time is always counted 
IMS LD relative to the time when the run of the unit-of-learning has been started. Authors have to take care that 
Specification the time limits set on role-parts, acts and plays are logical." 



Explanation 

Formal 
Description 



The value of the attribute time limit of aMethod must be greater than the value of the time 
limit of any Play. That is, the Play ( s ) cannot finish after the Method. 

V m, p, cm, cp | me Method Ape Play a cm e Complete-Method a cp e Complete-Play a play- 
ref(p, m) a complete-unit-of-learning-ref(cm, m) a complete-play-ref(cp, p) M time-limit(cm) > time- 
limit(cp) 



Design 
Axiom 2 



IMSLD 

Specification 



Page 90: "The same role can be associated with different activities or environments in different role- 
parts, and the same activity or environment can be associated with different roles in different role- 
parts. However, the same role may only be referenced once in the same act." 



Explanation For the same Act, the Roles involved in the execution of the Act are disjoint. 



Formal De- V a, r, rp | a e Act Are Role a rp e Role-Part a role-part-ref(rp, a) a 
scription rpl e Role-Part a rpl ^ rp a role-part-ref(rpl, a) a role-ref(r, rpl) 



role-ref(r, rp) El — ■ 3 rpl 



Runtime 
Axiom 1 



IMSLD 
Specification 

Explanation 

Formal 
Description 



Page 25 (item 0.2.1): "The create-new attribute indicates whether multiple occurrences of this role 
may be created during runtime. When the attribute has the value "not-allowed" then there is always 
one and only one instance of the role." 

If the value of the attribute create - new is "not-allowed", it can have an only instance of the Role 
at which it is applied. 

V r | r e Role a create-new(r) = "not-allowed" M — ■ 3 rl | rl e r 



LD standard, but the semantics expressed in natural 
language is not included in the ontology. 

To deal with this issue, Amorim, Lama, Sanchez, 
Riera and Vila (2006) propose an ontology also based 
on the IMS LD that incorporates all its semantics, 
adding a number of axioms to the conceptual model: 
they are extracted from the information model where 
are expressed as natural language restrictions to the 
values of the concept attributes (table 1). Therefore 
this ontology does not modify the IMS LD spefication, 
but it incorporates all the semantics in order to enable 
software programs to manage directly from the repre- 
sentation in the ontology. With this formal specification 
this ontology, which is developed in F-Logic (Kiefer, 
Lausen, Wu, 1996) and OWL, has been used to validate 
the consistency of unit of learnings defined in authoring 
tools and as a language for knowledge interchanging 
between agents in collaborative environment (Riera 
et al., 2005). 



CONCLUSION 

Ontologies in Education are usually developed follow- 
ing a metadata standard whose intend is capture the 
semantics of a given theory or specification. Most of 
metadata standards have been modelled following the 
XML-Schema language (Thompson, Beech, Maloney, 
& Mendelsohn, 2004) which is not expressive enough 
to describe the semantics (or meaning) associated to 
the elements defined in the metadata. Thus, the main 
limitations of the XML-Schema language are (Gil & 
Ratnakar, 2002) that hierarchical relations between 
two or more concepts cannot be explicitly defined, and 
general and formal constraints (or axioms) between con- 
cepts, attributes, and relations cannot be specified. 

To solve these limitations of the XML-Schema lan- 
guage the modelling of metadata standards needs to be 
enriched in order to describe explicitly and formally the 
semantics of its elements. Thus misinterpretations or 
errors are avoided when the instances of the concepts 
are created. This is the main purpose of the ontologies 
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developed in the Education field: to favour the interoper- 
ability between software programs by representing all 
the semantics of the metadata, not only the concepts 
and relations expressed in XML-based formats. 
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KEY TERMS 

Collaborative Learning Environment: Software 
system oriented to support collaborative learning ex- 
perience in which two or more agents engage the goal 
of constructing knowledge based on group discussion 
and decision-making processes. 

Interoperability: Capability to communicate, 
execute programs, or transfer data among various 
functional units in a manner that requires the user to 
have little or no knowledge of the unique characteristics 
of those units. 

Learning Design : Description of a method enabling 
learners to attain certain learning objectives by perform- 
ing certain learning activities in a certain order in the 
context of a certain learning environment. A learning 
design is based on the pedagogical principles of the 
designer and on specific domain and contexts variables 
(e.g., designs for math be ematics teaching can differ 
from designs for language teaching). 

Learning Objects: Any reproducible and address- 
able digital or non-digital resource used to perform 
learning activities or support activities. Examples are: 
web pages, text books, text processors, instruments, 
etc. 



Metadata: Information about data, which can be 
used to comprehend, use, and manage data. 

Ontology: Formal and explicit specification of a 
shared conceptualization, where conceptualization 
refers to an abstract model of a concept in the world; 
formal means that the ontology should be machine 
readable; explicit means that the type of concepts and 
the constraints on their use are explicitly defined; and 
shared reflects the notion that an ontology captures 
consensual knowledge accepted by a group. 

Ontology Language: Formal language based on 
a logic paradimg that can represent concepts and the 
constraints between them. Reasoning capabilities of 
the language depend on the paradigm in which the 
language is based on. 
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INTRODUCTION 

At present, ontologies are considered to be an appropri- 
ate solution to the problem of heterogeneity in data, 
since ontological methods make it possible to reach 
a common understanding of concepts in a particular 
domain. However, utilizing a single ontology is neither 
always possible nor recommendable, given that different 
tasks or different points of view usually require different 
conceptualizations. This can lead to the usage of dif- 
ferent ontologies, although in some cases the different 
ontologies collectively might contain information that 
could be overlapping and possibly even contradictory. 
This, in turn, represents another type of heterogeneity 
that can result in inefficient processing or misinterpreta- 
tion of data, information, and knowledge. 

To address this problem while at the same time 
insure an appropriate level of interoperability between 
heterogeneous systems, it is necessary to find corre- 
spondences or mappings that exist between the elements 
of the (different) ontologies being used. This process 
is known as ontology alignment. 

This article offers an updated overview of ontology 
alignment, including a detailed explanation of what 
alignment consists of, and how it can be achieved. 
First, ontologies are defined using a fusion of differ- 
ent interpretations. This is followed by a definition of 
the concept of ontology alignment and, using a simple 
example, some of the most commonly used alignment 
techniques are illustrated. Subsequently, a case is made 
for the importance of automating the process of ontology 
alignment, summarizing some of the main alignment 
systems currently in use. Finally, in the context of future 
directions, a discussion is presented of the advantages 



associated with integrating ontology alignment into 
systems that require exchanging information in an 
automatic fashion. 



BACKGROUND 

Towards the end of the 20 th and beginning of the 21 st 
centuries, the term "ontology" (or ontologies) gained 
usage in computer science to refer to a research area in 
the subfield of artificial intelligence primarily concerned 
with the semantics of concepts and with expressive (or 
interpretive) processes in computer-based communi- 
cations. In this context, there are many definitions of 
ontology, and these definitions have evolved over the 
years. Gruber offered one of the first definitions of 
ontology in 1993, as follows (Gruber, 1993): 

"An ontology is an explicit specification of a concep- 
tualization ". 

Gruber 's definition became the most frequently 
referenced one in the literature, and became the base 
or working definition for those working in this area. 

At present, ontologies are viewed as a practical 
way to conceptualize information that is expressed in 
electronic format, and are being used in many applica- 
tions including the Semantic Web, e-Commerce, data 
warehouses, or information integration and retrieval. 
The basic idea behind these applications is to use 
ontologies to reach a common level of understanding 
or comprehension within a particular domain (e.g., 
a particular industry, medicine, housing, car repair, 
finances, etc.). 
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However, certain systems that encompass a large 
number of components associated with different 
domains would generally require the use of different 
ontologies. In such cases, using ontologies would not 
reduce heterogeneity but rather would recast the het- 
erogeneity problem into a different (and higher) frame- 
work wherein the problem becomes one of ontology 
alignment, thereby allowing a more efficient exchange 
of information and knowledge derived from different 
(heterogeneous) data bases, knowledge bases, and the 
knowledge contained in the ontologies themselves. 
In this manner, ontology alignment enhances system 
interoperability. 



ONTOLOGY ALIGNMENT 

Euzenat et al. defined the problem of ontology alignment 
in the following manner (Euzenat et al., 2004): 

"Given two ontologies which describe each a set of 
discrete entities (which can be classes, properties, rules, 
predicates, etc.), find the relationships (e.g. equivalence 
or subsumption) holding between these entities. " 

The key issue in ontology alignment is finding 
which entity in one ontology corresponds (in terms of 
meaning) to another entity in one (or many) ontology 
(or ontologies). Essentially, one might say that ontol- 



ogy alignment can be reduced to defining a similarity 
measure between entities in different ontologies and 
selecting a set of correspondences between entities 
of different ontologies with the highest similarity 
measures. 

There are different methods to calculate the similar- 
ity measures between entities, and collectively these 
methods are known as ontology alignment techniques. 
Many of these techniques are derived from other fields 
(for instance, discrete mathematics, automatic learning, 
data base design, pattern recognition, among others). 
Consequently, some of these techniques attempt to 
compare text strings that describe the entities in the 
ontologies (terminology-based ontology alignment), 
while others calculate the similarity measures between 
entities taking into account the structure of their cor- 
responding ontologies (structural ontology alignment). 
A complete classification of alignment techniques has 
been developed by Martinez (Martinez, 2007). 

Using a simple example, the following discussion 
illustrates some of the basic ontology alignment tech- 
niques that are currently used. In this example, two 
simple ontologies are examined, as shown in Figure 
1. 

The ontologies shown in Figure 1 describe vari- 
ous entities in the real world: sets of elements that 
share certain characteristics or classes (e.g., Wing, 
Car, Bus, etc.), instances of classes (individuals) and 
their relations (e.g. a specific Ferrari F50 belongs to a 



Figure 1. An example illustrating the alignment between two ontologies 
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Table 1. Some examples of ontology alignment techniques 



Correspondence 


Technique Used 


Description 


Thing - Object 
Vehicle - Mean of transport 


Language-based 

terminological 

technique 


A support tool such as a dictionary is used (e.g. WordNet, 
2007) to uncover that both terms are synonymous. 


Car - Car 
Ferrari F50 - Ferrari F50 


Terminological 

technique based on 

text strings 


Text string that describe the entities completely coincide, 
since it can be shown that both entities have the same or 
similar semantics. 


Plane - Aeroplane 


Terminological 
technique based on 
text strings (suffix) 


The first term is a suffix of the second, which would 
indicate that a relationship exists between them. 


Winged vehicle -Air mean 


Structural technique 


In the first ontology, Winged vehicle is a child class of 
Vehicle and parent class of Plane. In the second, Air mean 
is child class of Mean of transport and parent class of 
Aeroplane. Since Vehicle was shown to be equivalent to 
Mean of transport, and Plane refers to the same concept 
as Aeroplane, both classes would show ascendants and 
descendants of the same or similar semantics, indicating 
a semantic relationships between them. 



specific person, Mark), as well as three different types 
of relationships between individuals (isA, partOf and 
hasOwner). 

Each one of the ontologies presented in this ex- 
ample has its own set of entities organized according 
to a specific taxonomy. The two representations arise 
due to the fact that they correspond to two different 
perspectives or points of view, each associated with 
a different domain. However, some pairs of entities 
can be identified in these ontologies that share the 
same or similar semantics. Thus, it's probable that the 
Plane class in the first ontology and Aeroplane in the 
second ontology refer to the same concept in general 
(in the real world), given that the terms that describe 
them are synonymous terms. Table 1 shows some of 
the pairs of entities of these ontologies among which 
semantic similarities could exist, as would be revealed 
once alignment techniques are applied. The technique 
that is being applied in each case is shown, along with 
a description of the technique itself. 

Ontology Alignment Systems 

Ontology alignment is intended for use in an automated 
fashion for two primary reasons: first, it's a time-con- 
suming, tedious, and occasionally difficult task, and, 
second, its true value is revealed when it is integrated 
into processes that exchange information automati- 
cally. This has resulted over the past few years in the 



emergence of multiple software tools that have been 
developed by diverse research groups and well-estab- 
lished international organizations, primarily associated 
with the academic community. The tools, designed to 
automatically identify the correspondences that may 
exist between entities of different ontologies, are called 
ontology alignment systems. 

Through the development of these tools, a consid- 
erable number of ontology alignment systems have 
become available. Each one of these systems offers a 
unique set of advantages, disadvantages, and perfor- 
mance characteristics. Table 2 lists the main ontology 
alignment systems that are currently available. 

An ontology alignment system accepts one (or more) 
ontologies as input, and provides, as output, a set of 
correspondences between their elements. This set of 
correspondences is referred to as alignment. The quality 
of a particular alignment depends on the correctness 
and completeness of the correspondences it has found. 
An alignment system is typically based on several of 
the latest alignment techniques in conjunction with 
its own methods with the aim of obtaining the most 
precise and complete alignment possible. 

FUTURE TRENDS 

At present, there are several ontology alignment 
systems capable of identifying, with acceptable ef- 
ficiency, semantic correspondences that may exist 
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Table 2. Ontology alignment systems 



Name 


Developed by 


References 


Anchor- 
PROMPT 


Stanford University (USA) 


Noy & Musen, 2003 


Chimaera 


Stanford University (USA) 


McGuinness, Fikes, Rice & Wilder, 2000 


CMS 


School of Electronics and 
Computer Science & 

Advanced Knowledge 
Technologies group 
(University of Southampton), 
Hewlett Packard 
Laboratories (UK) 


CMS, 2006, 
Kalfoglou & Hu, 2005 


COMA++/ 
COMA 


University of Leipzig 
(Germany) 


COMA, 2006, 

Aumueller, Do, Massmann & Rahm, 2005, 
Massmann, Engmann & Rahm, 2006 


CtxMatch 


University of Trento (Italy) 


Zanobini, 2004 


Blue 


University of Washington 
(USA) 


Doan, Madhavan, Domingos & Halevy, 2002, 
Doan, Madhavan, Domingos & Halevy, 2004 


Falcon-AO 


Southeast University 
(China) 


Jian, Hu, Cheng & Qu, 2005, 

Hu, Jian, Qu & Wang, 2005, 

Hu, Zhao & Qu, 2006, 

Hu, Cheng, Zheng, Zhong & Qu, 2006 


FOAM [APFEL, 
NOM, QOM] 


University of Karlsruhe 
(Germany) 


Ehrig & Staab, 2004, 
Ehrig & Sure, 2005, 
Ehrig, Staab & Sure, 2005 


HCONE-merge 


University of Aegean 
(Greece) 


Kotis, Vouros & Padilla, 2004, 
Kotis, Vouros & Stergiou, 2005, 
Vouros & Kotis, 2005 


H-Match 


University of Milan (Italy) 


Castano, Ferrara & Montanelli, 2003 


LOM 


Teknowledge Corporation 
(Palo Alto, USA) 


Li, 2004 


MAFRA 


Instituto Politecnico do 
Porto (Portugal) 


Maedche, Motik, Silva & Volz, 2002 


MapOnto 


University of Toronto 
(Canada), University of 
Rutgers (USA) 


An, Borgida & Mylopoulos, 2005 


MetaQuerier 


University of Illinois (USA) 


Chang, He & Zhang, 2004, 
Chang. He & Zhang, 2005 


MoA 


Electronics and 
Telecommunications 
Research Institute (Korea) 


Jaehong et al., 2005 


OLA 


INRIA Rhone-Alpes 
(France), University of 
Montreal (Canada) 


Euzenat, Loup, Touzani & Valtchev, 2004, 
Euzenat & Valtchev, 2004, 
Euzenat, Guerin & Valtchev, 2005 


OntoBuilder 


Technion Israel Institute of 
Technology (Israel) 


Gal, Modica & Jamil, 2004 


OntoMerge 


Yale University (USA), 
University of Oregon (USA) 


Dou, McDermott & Qi, 2002 


Rondo 


University of Leipzig 
(Germany), Microsoft 
Research (USA) 


Melnik, Rahm & Bernstein, 2003 


S-Match 


University of Trento, Italy 


Giunchiglia, Shvaiko & Yatskevich, 2004 


SAMBO 


University of Linkopings 
(Sweden) 


Lambrix & Tan, 2006 
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between entities associated with different ontologies. 
However, the true potential of ontology alignment 
will be realized when this methodology is integrated 
in processes that require that information between 
different systems be exchanged fully automatically. 
This would be achievable when ontology alignment 
systems become sufficiently powerful to resolve, in 
real time and with minimal error, alignment problems 
in specific domains. 

Once these issues are successfully addressed, it 
will become possible to attain an appropriate level of 
interoperability between heterogeneous systems that 
were previously not exploited jointly, thereby repre- 
senting a high water mark in the field of information 
and communications technologies. Multiple systems 
of different characteristics and origins would thus be 
able to communicate with each other, making it pos- 
sible to reveal new knowledge that could have previ- 
ously remained uncovered in disjointed information 
systems. This would potentially provide human users 
with a wide range of automated intelligent systems 
and services capable of interrelating with each other 
without external assistance, which in turn would con- 
siderably facilitate one of the most challenging tasks: 
the automatic, efficient, and reliable exploitation of 
large quantities of information. 



CONCLUSION 

In some applications, the use of a single ontology to 
fully describe an entire domain is generally not an 
adequate solution, and it normally becomes necessary 
to use different ontologies. In such cases, the need 
arises to find relationships between the elements of 
the different ontologies, a process known as ontology 
alignment. 

Automation of the ontology alignment process can 
be reasonably achieved, which is precisely why this 
process is especially useful in environments or ap- 
plications that require the automatic interoperability 
between systems. Currently, there are numerous ontol- 
ogy alignment systems available, and most of these are 
the result of academic or basic research. These systems 
can be viewed as software tools capable of finding cor- 
respondences or relationships that may exist between 
the elements of different ontologies. These tools can 
provide rather remarkable results, especially when tak- 
ing into account the fact that they essentially remain 



works in progress, still in the initial development or 
testing phases. 

In the future, it is expected that ontology alignment 
systems will reach acceptable levels of robustness, ef- 
ficiency, and reliability, which would make it possible 
to apply these systems to processes that automatically 
exchange data between different systems that indi- 
vidually utilize different ontologies. These automated 
interactions between systems would not only reduce 
user intervention but would also automate many 
time-consuming, complex, and computationally costly 
tasks that are currently either performed manually or 
not at all. 



ACKNOWLEDGMENT 

This work was partially supported by the Spanish Minis- 
try of Education and Culture (Ref TIN2006-13274) and 
the European Regional Development Funds (ERDF), 
grant (Ref. PIO52048) funded by the Carlos III Health 
Institute, grant (Ref. PGIDIT 05 SIN 10501 PR) from 
the General Directorate of Research of the Xunta de 
Galicia and grant (File 2006/60) from the General Di- 
rectorate of Scientific and Technologic Promotion of 
the Galician University System of the Xunta de Galicia. 
The work of Jose M. Vazquez is supported by an FPU 
grant (Ref AP2005-1415) from the Spanish Ministry 
of Education and Science. 

REFERENCES 

An, Y., Borgida, A., & Mylopoulos, J. (2005). Construct- 
ing Complex Semantic Mappings between XML Data 
and Ontologies. Proceedings ofISWC'05. 

Aumueller, D., Do, H.H., Massmann, S., & Rahm, E. 
(2005). Schema and ontology matching with COMA++. 
SIGMOD Conference. 

Castano, S., Ferrara, A., & Montanelli, S. (2003). 
H-MATCH: an algorithm for dynamically matching 
ontologies in peer-based systems. Proceedings of 
the First Workshop on Semantic Web and Databases 
(SWDB-03), VLDB 03, Berlin, Germany. 

Chang, C, He, B., & Zhang, Z. (2004). MetaQuerier 
over the Deep Web: Shallow Integration across Ho- 
listic Sources. Proceedings of the VLDB Workshop on 



1287 



Ontology Alignment Overview 



Information Integration on the Web (VLDB-IIWeb '04), 
Toronto, Canada. 

Chang, C, He, B., & Zhang Z. (2005). Towards Large 
Scale Integration: Building a MetaQuerier over Data- 
bases on the Web. Proceedings of the Second Conference 
on Innovative Data Systems Research (CIDR 2005), 
Asilomar, California. 

COMA Website (2006). URL: http://dbs.uni-leipzig. 
de/en/Research/coma.html/ 

Crosi Mapping System Website (2006). URL: http:// 
www.aktors.org/crosi/deliverables/summary/cms. 
html/ 

Doan, A., Madhavan, J., Domingos, R, & Halevy, A. 
(2002). Learning to map between ontologies on the 
semantic web. Proceedings of the World-Wide Web 
Conference, Hawai, USA. 

Doan, A., Madhavan, J., Domingos, R, & Halevy, 
A. (2004). Ontology Matching: A Machine Learning 
Approach. Staab, S. & Studer, R. (eds.). Handbook on 
Ontologies in Information Systems, Springer- Velag, 
397-416. 

Dou, D., McDermott, D., & Qi, P. (2002). Ontology 
translation by ontology merging and automated reason- 
ing. Proceedings of the EKAW2002 Workshop on On- 
tologies for Multi-Agent Systems. Sigiienza, Spain. 

Ehrig, M., & Staab, S. (2004). 
QOM - Quick Ontology Mapping. 
Proceedings of the Third International Semantic Web 
Conference, LNCS 3298, 683-697. Springer, Hiro- 
shima, Japan. 

Ehrig, M., & Sure, Y. (2005). FOAM - Framework for 
Ontology Alignment and Mapping - Results of the On- 
tology Alignment Evaluation Initiative. Proceedings of 
the Workshop on Integrating Ontologies, 156, 72-76. 

Ehrig, M., Staab, S., & Sure, Y. (2005). Bootstrapping 
Ontology Alignment Methods with APFEL. Proceed- 
ings of the 4 th International Semantic Web Conference, 
ISWC2005, LNCS 3729, 186-200. Springer. 

Euzenat, J., Le Bach, T., Barrasa, J., Bouquet, R, De 
Bo, J., Dieng, R., Ehrig, R., et al. (2004). State of the 
art on ontology alignment. Deliverable D2.2.3 vl.2. 
Knowledge Web. URL: http://knowledgeweb.seman- 
ticweb.org/ 



Euzenat, J., & Valtchev, P. (2004). Similarity-based 
ontology alignment in OWL-Lite. Proceedings of 16 th 
european conference on artificial intelligence (ECAI), 
333-337. Amsterdam, Holland. 

Euzenat, J., Loup, D., Touzani, M., & Valtchev, P. 
(2004). Ontology alignment with OLA. Proceedings 
of 3 rd IS WC2 004 workshop on Evaluation of Ontology- 
based tools (EON), 59-68, Hiroshima, Japan. 

Euzenat, J., Guerin, P., & Valtchev, P. (2005). OLA 
in the OAEI 2005 alignment contest. Proceedings K- 
Cap 2005 workshop on Integrating ontology, 97-102, 
Banff, Canada. 

Gal, A., Modica, G. A., & Jamil, H. M. (2004). Onto- 
Builder: Fully Automatic Extraction and Consolidation 
of Ontologies from Web Sources. Proceedings of the 
ICDE 2004. 

Giunchiglia, E, Shvaiko, P., & Yatskevich, M. (2004). 
S-Match: An Algorithm and an Implementation of 
Semantic Matching. Proceedings ofESWS'04. 

Gruber, T. R. A translation approach to portable on- 
tology specification. (1993). Knowledge Acquisition, 
5(2), 199-200. 

Hu, W., Jian, N., Qu, Y., & Wang, Y (2005). GMO: A 
graph matching for ontologies. Proceedings of the K- 
CAP workshop on Integrating Ontologies, 41-48. 

Hu, W., Zhao, Y, & Qu, Y (2006). Partition-based 
block matching of large class hierarchies. Proceedings 
of the 1 st Asian Semantic Web Conference (ASWC'06), 
72-83. 

Hu, W., Cheng, G., Zheng, D., Zhong, X., & Qu, Y 
(2006). The Results of Falcon-AO in the OAEI 2006 
Campaign. ISWC Ontology matching workshop. Ath- 
ens, USA. 

Jaehong, K., Jang, M., Young-Guk, H., Joo-Chan, S. 
& Jo, S. (2005). MoA: OWL ontology merging and 
alignment tool for the semantic web. Lecture notes in 
Computer Science, 3533/2005, 722-731, Springer. 

Jian, N., Hu, W., Cheng, G., & Qu, Y (2005). Falcon- 
AO: Aligning Ontologies with Falcon. Proceedings 
ofK-Cap 2005 Workshop on Integrating Ontologies, 
85-91, Banff, Canada. 

Kalfoglou, Y, & Hu, B. (2005). CMS: CROSI Map- 
ping System - Results of the 2005 Ontology Alignment 



1288 



Ontology Alignment Overview 



Contest. Proceedings ofK-Cap '05 Integrating Ontolo- 
gies workshop, 77-85, Banff, Canada. 

Kotis, K., Vouros, G. A., & Padilla, J. (2004). HCOME: 
tool-supported methodology for collaboratively devis- 
ing living ontologies. Semantic Web and Databases. 
Second International Workshop, SWDB. Toronto, 
Canada. 

Kotis, K., Vouros, G., & Stergiou, K. (2005). Towards 
Automatic Merging of Domain Ontologies: The 
HCONE-merge approach. Elsevier 's Journal of Web 
Semantics (JWS), 4:1, 60-79. 

Lambrix, P., & Tan, H. (2006). SAMBO -A System for 
Aligning and Merging Biomedical Ontologies. Journal 
of Web Semantics, Special issue on Semantic Web for 
the Life Sciences, 4(3), 196-206. 

Li, J. (2004). LOM: A Lexicon-based Ontology Map- 
ping Tool. Proceedings of the Performance Metrics 
for Intelligent Systems (Per MIS. '04). 

Maedche, A., Motik, B., Silva, N., & Volz, R. (2002). 
MAFRA - A Mapping Framework for Distributed On- 
tologies. Proceedings of 13th European Conference on 
Knowledge Engineering and Knowledge Management 
(EKAW). Sigiienza, Spain. 

Martinez, M. (2007). Analysis and comparative study 
of ontology alignment systems, and development of 
an ontology alignment system optimized for aligning 
medical ontologies. Pazos, A., Vazquez, J.M. (dirs.). 
University of A Coruna. Final project. 

Massmann, S., Engmann, D., & Rahm, E. (2006). 
COMA++: Results for the Ontology Alignment Con- 
test OAEI 2006. International Workshop on Ontology 
Matching (5th ISWC-2006), Athens, Georgia, USA. 

Melnik, S.,Rahm,E.,&Bernstein,P.A. (2003).Rondo: 
AProgramming Platform for Model Management. Pro- 
ceedings ofACMSIGMOD 2003, San Diego, USA. 

McGuinness, D. L., Fikes, R., Rice, J., & Wilder, S. 
(2000). An environment for merging and testing large 
ontologies. Proceedings of 7th Intl. Conf. on Principles 
of Knowledge Representation and Reasoning (KR2000). 
Colorado, USA. 

Noy, F. N., & Musen, A. M. (2003). The PROMPT 
Suite: Interactive Tools for Ontology Merging and 
Mapping. International Journal of Human-Computer 
Studies, 59/6, 983-1024. 



Vouros, G., & Kotis, K. (2005). Extending HCONE- 
merge by approximating the intended interpretations 
of concepts iteratively. 2nd European Semantic Web 
Conference, Heraklion, Creta, Greece. 

WordNet, 2007. Cognitive Science Laboratory. Princ- 
eton University. URL: http://wordnet.princeton.edu/ 

Zanobini, S. (2004). Improving ctxmatch by means of 
grammatical and ontological knowledge - in order to 
handle attributes. Technical Report 554, Department 
of Information and Communication Technology, Uni- 
versity of Trento, Italy. 



KEY TERMS 

Class: A set that contain individuals which share 
certain characteristics. The word concept is sometimes 
used in place of class. Classes are a concrete represen- 
tation of concepts. 

Individual: A object in the domain that we are 
interested in. Individuals are also known as instances 
of classes. 

Interoperability: A state or situation through 
which heterogeneous systems can exchange data and/or 
processes. 

Mapping: A correspondence found during the 
process of ontology alignment. 

Ontology: A formal and explicit specification of a 
shared conceptualization. 

Ontology Alignment: A process that consists of 
finding the semantic relationships that may exist be- 
tween different elements in different ontologies. 

Ontology Alignment System: A software tool 
capable of conducting the alignment of ontologies in 
an automated fashion. 

Ontology Mapping: See ontology alignment. 

Ontology Matching: See ontology alignment. 

Relation: Alinkbetween individuals. In the field of 
ontologies, relations are also known as properties. 
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INTRODUCTION 

Sometimes the use of a single ontology is not sufficient 
to cover different vocabularies for the same domain, 
and it becomes necessary to use several ontologies in 
order to encompass the entire domain knowledge and its 
various representations. Disciplines where this occurs 
include medical science and biology, as well as many 
of its associated subfields such as genetics, epidemiol- 
ogy, etc. This may be due to a domain's complexity, 
expansiveness, and/or different perspectives of the same 
domain on the part of different groups of users. In such 
cases, it is essential to find relationships that may exist 
between the elements of a specific domain's different 
ontologies, a process known as ontology alignment. 

There are several methods for identifying the rela- 
tionships or correspondences between elements associ- 
ated with different ontologies, and collectively these 
methods are called ontology alignment techniques. 
Many of these techniques stem from other fields of study 
(e.g., matching techniques in discrete mathematics) 
while others have been specifically designed for this 
purpose. The key to successfully aligning ontologies is 
based on the appropriate selection and implementation 
of a set of those ontology alignment techniques best 
suited for a particular alignment problem. 

Ontology alignment is a complex, tedious, and 
time-consuming task, especially when working with 
ontologies of considerable size (containing, for instance, 
thousands of elements or more) and which have com- 
plex relationships between the elements (for example, a 
particular problem domain in medicine). Furthermore, 
the true potential of ontology alignment is realized when 
different information-exchange processes are integrated 



automatically, thereby providing the framework for 
reaching a suitable level of efficient interoperability 
between heterogeneous systems. The importance of 
automatically aligning ontologies has therefore been 
a topic of major interest in recent years, and recently 
there has been a surge in a variety of software tools dedi- 
cated to aligning ontologies in either a fully or partially 
automated fashion. Some of these tools — generally 
referred to as ontology alignment systems — have been 
the result of well known and respected research centers, 
including Stanford University and Hewlett Packard 
Laboratories, for instance. In Shvaiko & Euzenat, 2007, 
updated information is given regarding the currently 
available ontology alignment systems. 

Each ontology alignment system combines different 
alignment approaches along with its own techniques, 
such that correspondences between the different on- 
tologies can be detected in the most complete, precise, 
and efficient manner. Since each system is based on 
its own approximation techniques, different systems 
yield different results, and therefore the quality of the 
results can vary among systems. Most of the alignment 
systems are oriented to solving problems of a general 
nature, since ontologies associated with a single domain 
share certain characteristics that set them apart from 
ontologies associated with other domains. Recently, 
some systems have emerged that are designed to align 
ontologies in a specific domain. An example is the 
SAMBO alignment system (Lambrix, 2006) in the 
biomedical domain. These and other domain-specific 
systems can produce excellent results (when used for 
the domains for which they were designed), but are 
generally not useful when applied to other domains. 
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This article presents a classification of the most com- 
monly used, recently developed alignment techniques, 
supported by simple examples to illustrate the specific 
techniques underlying different systems. Future direc- 
tions in ontology alignment are also examined. 



BACKGROUND 

The key to ontology alignment is to find those entities 
in one ontology that may correspond to other entities 
in another ontology. Basically, this can be viewed as 
finding a similarity measure between elements (or so- 
called entities) associated with different ontologies, 
and subsequently selecting the set of correspondences 
that produce the strongest measures of similarity. There 
are, however, different ways to compute similarity 
measures; there are various studies dedicated to the 
classification of these techniques (Rahm & Bernstein, 
2001, Euzenat & Valtchev, 2004, Euzenat et al., 2004, 
Shvaiko & Euzenat, 2005). 

Following these classification schemes (especially 
those undertaken by Euzenat and Valtchev (Euzenat & 
Valtchev, 2004) and based on Euzenat et al., 2004), the 
next section will introduce an abbreviated classifica- 
tion of those ontology alignment techniques that are 
most commonly utilized by current ontology alignment 
systems. This condensed classification is centered on 
the type of element being manipulated by the alignment 
technique, and complements the taxonomy proposed by 
Rahm and Bernstein (Rahm & Bernstein, 2001), and 
— for the purpose of clarity and brevity — summarizes 
only those alignment techniques that compare on an 
individual basis a single element in one ontology with 
another element associated with another ontology 
(known as local alignment techniques, as in Euzenat 
et al., 2004). 



ONTOLOGY ALIGNMENT TECHNIQUES 

Ontology alignment techniques can be classified ac- 
cording to the following (please refer to Figure 1): 

1. Terminological techniques. These calculate the 
similarity between text strings and describe several 
elements in the ontologies (names, labels, and/or 
comments). There are two types of terminological 



techniques: those based on text strings and those 
based on the language. 

/./. Terminological techniques based on text strings. 

These are based on the idea of comparing the struc- 
ture in text strings, which are viewed as sequences of 
characters. These techniques consider that the similar- 
ity between two terms increases when the similarity 
between their corresponding text strings also increases, 
but without considering the underlying semantics in the 
terms. In this manner, the application of a technique of 
this type to the terms Apple and Apples would yield a 
relatively high measure of similarity, whereas the ap- 
plication of the same technique to the terms Apple and 
Orange would yield a lower degree of similarity (or 
a lower similarity measure), since in the second case 
the text strings are quite different. The isolated use of 
these techniques is usually not recommended, since it 
is preferable to use them in conjunction to other, more 
powerful alignment techniques; these can be easily 
illustrated with the following example: it would be er- 
roneous to conclude that the terms Cream and Scream 
are highly similar (although their meanings are very 
different), or that the terms Student and Pupil are very 
distinct or dissimilar (although the semantic concepts 
are generally the same). Some examples of termino- 
logical techniques based on text strings are the distance 
measure proposed by Hamming (Hamming, 1950), 
which counts the number of different characters in two 
different text strings; the distance measure suggested by 
Levenshtein (Levenshtein, 1966), which examines the 
minimum number of operations (insertions, deletions 
and/or substitutions) that are necessary to transform one 
text string into another; and the distance measure Jaw 
(Jaro, 1989), which analyzes the number and order of 
two common characters in two text strings. 

1.2. Terminological techniques based on language. 

These techniques are more complex but more reliable 
than those previously discussed, and do not treat terms 
as simple sequences of characters that are independent 
of one another. Rather, these techniques view terms as 
groups of elements with meaning (lexima and mor- 
phema, i.e., prefixes and suffixes). The main objective 
of these techniques is to discover the similarity that 
may exist between terms associated with one concept, 
although the relationships can be formed by strings of 
characters that are very different. In other words, these 
techniques attempt to obviate the different termino- 
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logical variations that can affect terms that are being 
mutually compared. These techniques, in turn, can be 
classified according to whether intrinsic and extrinsic 
approaches: 

/. 2. 1. Intrinsic techniques. These are oriented toward 
detecting the similarity between terms that have un- 
dergone morphological and syntactical variations (e.g, 
Mean of transport, Mean of transportation, Transporta- 
tion mean), as in Porter Stemming Algorithm (Porter, 
1980). 



(since Myocyte is a type of Cell). Some of the exter- 
nal linguistic resources most commonly used by such 
alignment systems currently in use include WordNet 
(WordNet, 2007), as an English-language resource, or 
UMLS (National Library of Medicine, 2007) in the 
medical domain. Other extrinsic techniques that are 
in use include multilingual techniques, dedicated to 
finding relationships between terms written in differ- 
ent languages (such as the Spanish word celula and 
its English counterpart, cell) and using multilingual 
dictionaries such as Euro WordNet (Vossen, 1997). 



1.2.2. Extrinsic techniques. These consist of utiliz- 
ing external linguistic resources, such as dictionaries 
and thesaurus, in order to find the similarity between 
lexical variations in the same term (e.g, Mean of trans- 
port and Vehicle). External techniques consider the 
fact that there usually is an equivalence relationship 
between synonyms, and a subsuming relationship be- 
tween hyponyms. In this manner, an alignment system 
based on extrinsic terminological techniques would 
presumably be capable of detecting, for instance, an 
equivalence relationship between the terms Leukocyte 
and White blood cell (since they are synonymous) and 
a subsumed relationship between Moycyte and Cell 



2. Structural Techniques. In addition to compar- 
ing text strings that describe the entities in each 
ontology, it is frequently useful to compare the 
internal structure of the entities themselves, or the 
relationships that each entity may maintain with 
other entities (external structure comparison). 

2.1. Internal structure comparison techniques. These 
techniques compare internal characteristics of the enti- 
ties, such as the rank, cardinality, transitivity, and/or 
symmetry of its properties (attributes and relationships). 
For instance, if in one ontology A there is an entity Per- 



Figure 1. Ontology alignment techniques 
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son with three attributes (birth_date of type date; name 
of type string, and weight of type int), and in another 
ontology B there is an entity Human_being with two 
attributes (date_of_birth of type date; and firs tjiame 
of type string), a technique of this type might conclude 
that there is certain similarity between these two enti- 
ties, since the types of two of the attributes coincide. 
In this concrete case the technique's conclusion would 
have been correct: Person and Human_Being refer to 
the same concept in the real world. However, it is easy 
to find cases in which the technique would produce er- 
roneous results. For instance, if the entity in ontology 
B were Car with three attributes (registration_date of 
type date; color of type string; and weight of type int), a 
comparison of internal structure might suggest that the 
entities Person and Car were similar, since the ranks 
of the three attributes coincide although in reality they 
are entities associated with very different semantics. 
Consequently, given that it is frequently possible to 
find multiples entities in an ontology that represent 
similar internal characteristics, these techniques tend 
to be used in conjunction with other techniques (such 
as terminological techniques). It is probably wise to 
utilize a method to compare the internal structure during 
the initial alignment stages, in order to filter pairs of 
entities that could be related, and subsequently apply 
other techniques before finally deciding on the overall 
level of similarity. 

2.2. External structure comparison techniques. These 
techniques compute the similarity that may exist 
between entities by considering the position that the 
entities in question occupy within their respective on- 
tologies. The underlying principle is that, if two entities 
are similar, then there is likely to be some similarity 
with their adjacent (or neighboring) entities. These 
techniques tend to treat ontologies as graphs in which 
each node is a vertex in the ontology and each edge 
is a relationship between vertices; algorithms that are 
especially designed to work with graphs are used to 
find the relationships between elements in the ontolo- 
gies. As a matter of fact, this problem is equivalent to 
that or solving a graph homomorphism (Garey, 1979). 
One of the better known techniques for making the 
external-structure comparison is the one used by the 
Anchor-PROMPT ontology alignment system (Noy & 
Musen, 2000), which is based on the idea that if two 
pairs of entities in the source ontologies are similar 
and there are connected paths linking them, then the 
elements in those paths are also similar. 



3. Extensional techniques. These extensional (or 
extensible) techniques compare the extension 
or length of the classes of ontologies: in other 
words, their instantiations or examples. This is 
useful when the information about the entities to 
be compared is limited but there is additional data 
or information about their examples; alternately, 
they are useful as a means of supporting other 
alignment techniques in order to detect erroneous 
or misleading correspondences. For instance, if an 
ontology contains a class denoted as Human_be- 
ing with two instances, John and Mary, and the 
other ontology contains a class labeled Person 
with the same instances (John and Mary), then it 
could be inferred, by comparing all the instances 
of the ontologies, that the classes are similar. 

4. Semantic techniques. These types of techniques 
attempt to align the elements in the ontologies 
according to their semantic interpretation. The 
general approach is based on deductive methods 
that draw from theoretical models that provide a 
justification for the results that are obtained. Some 
examples include the Propositional S ATisfiability 
(SAT) and techniques based on Description Logics 
(DL). 

4.1. SAT techniques: the application of SAT techniques 
to the ontology alignment problem consists of trans- 
lating the information associated with pairs of terms 
between which a mathematical or formulaic relation- 
ship could exist. The relationship would be of the form 
Axioms^relielement^ element^), where element 1 and 
element 2 are the entities in the ontologies that are being 
examined to determine if there is a semantic relation- 
ship between them, and rel is the relationship that exists 
between the entities. Subsequently, the validity of the 
relationship (the aforementioned formula) is evalu- 
ated. The advantage of using SAT techniques is that 
it supports an exhaustive analysis of all the possible 
correspondences as well as the possibility of selecting 
only the major correspondences. 

4.2. Techniques based on DL: the expressivity of 
propositional language used by SAT techniques is 
limited, as they are unable to work with certain types of 
predicates. However, Description Logics provides the 
necessary expressivity to code alignment problems as 
propositional validity problems with greater flexibility. 
For instance, if an ontology contains the classes City, 
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Worker and Industrialjcity, as a City with more than 
600,000 Workers, and another ontology contains the 
classes Big_town, Inhabitant and Crowded_big_town, 
as a Big_town with more than 500,000 Inhabitants and 
it is established that all Workers are Inhabitants and 
that City is equivalent to Big_town, then a DL-based 
technique could deduce that an Industrial_city is a 
Crowded_big_town. 



FUTURE TRENDS 

Current ontology alignment systems take as input two 
ontologies and, once the alignment process is executed, 
yield as output a set of correspondences between their 
elements. Using up-to-date alignment techniques, this 
process is still very time consuming and computation- 
ally expensive especially in those cases where the input 
ontologies are large. This may not present a challenge in 
cases where the same ontologies are always used, since 
in such cases it would only be necessary to perform the 
alignment once, and subsequently the correspondences 
that have been revealed could be reutilized. 

However, there are applications or contexts where 
it becomes necessary to instantly identify which entity 
in ontology A corresponds with an entity in ontology 
B, without previously "knowing" the ontologies. In 
these cases, current ontology alignment techniques 
are limited, as is the case with the Semantic Web or 
the integration of information from different sources 
that were mutually "unknown" to each other. In these 
types of problems, it is more important to reduce the 
computational time that is necessary to carry out the 
alignment, although the quality of the alignment could 
be somewhat affected. As a result, it is very probable 
that in the next few years the field of ontology align- 
ment will see a major thrust being placed on exploring 
techniques capable of finding correspondences in an 
increasingly shortened amount of time. 

It is also expected that new techniques will emerge 
that will allow the consultation or usage of external 
linguistic resources in a more efficient and powerful 
manner than is now possible. The utilization of external 
resources is essential in alignment problems associated 
with specific domains, although current approaches are 
not capable of achieving optimal usage of these types 
of resources, thereby wasting a significant amount of 
potentially useful information. 



CONCLUSION 

Ontology alignment is an important aspect of practically 
any domain or application area where it is necessary to 
use an ontology. There are various approaches to find- 
ing semantic correspondences that may exist between 
elements of different ontologies, known as ontology 
alignment techniques. This paper has presented a 
condensed classification of those ontology alignment 
techniques that are most commonly used today. 

Clearly, not all alignment techniques are equally 
applicable to any problem. For instance, it is not useful 
to apply an extensional technique to ontologies that 
have no instances. Consequently, a number of factors 
ought to be considered when selecting among different 
alignment techniques for application to a particular 
problem. Among these are the domain to which the 
ontologies belong, the language in which the ontologies 
are expressed, the number and type of elements con- 
tained in the ontologies, etc. And, although a particular 
technique may be applicable to a specific alignment 
problem, there is also the question of errors. As a result, 
it should be stressed that aligning two ontologies is not 
simply the application of an alignment technique in 
an isolated manner: rather, the goal is mainly to find 
the appropriate combination of alignment techniques 
to be applied, such that the strengths of one technique 
can compensate another technique's weaknesses and 
limitations, with the overarching objective of uncov- 
ering an optimal set of correspondences between the 
ontologies of interest. 
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KEY TERMS 

Domain: Specific areas of interest (e.g., artworks 
by Picasso) or of knowledge (e.g., medicine, physics, 
etc.). 

Ontology: A formal and explicit specification of a 
shared conceptualization. 

Ontology Alignment: A process that consists of 
finding the semantic relationships that may exist be- 
tween different elements in different ontologies. 

Ontology Alignment System: A software tool 
capable of conducting the alignment of ontologies in 
an automated fashion. 

Ontology Alignment Technique: Method used to 
identify the semantic correspondences that may exist 
between the elements of different ontologies. 

Ontology Entity: An ontology entity represents a 
conceptual element of the domain of discourse. 

Thesaurus. Networked collection of controlled 
vocabulary terms. 
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INTRODUCTION 

A genetic algorithm is a global search method based on 
a simile of the natural evolution. Genetic Algorithms 
have demonstrated good performance for difficult prob- 
lems where the function to minimize is complicated. 
In this work we applied this optimization method to 
improve the acoustical properties of the Sonic Crystal 
(Martinez-Sala et Al.,1995) (Kushwaha et al., 1994), 
a kind of structures used in acoustics. 

In the last few years the propagation of the acous- 
tic waves in heterogeneous materials whose acoustic 
properties vary periodically in space have attracted 
considerable interest. The so-called Sonic Crystals 
are the typical example of this kind of materials in the 
range of the acoustic frequencies. These systems are 
defined as periodic structures with strong modulation 
of the elastic constants between the scatterers and the 
surrounding material. 

Recently, the strategy to enhance Sonic Crystals 
properties has been based on the use of scatterers with 
acoustical properties added. The use of local resonators 
(Liu et al., 2000) or Helmholtz resonators (Hu et al., 
2005) as scatterers have produced very good results 
Some authors also have built new structures with scat- 
terers made up of porous material improving the attenu- 



ation capability of the Sonic Crystals (Umnova et al., 
2006). However, the use of Sonic Crystals as outdoor 
acoustic barriers requires scatterers made up of robust 
and long-lasting materials. This is the reason why it 
seems interesting to analyze the possibility of optimiz- 
ing the attenuation capability of Sonic Crystals made 
with rigid scatterers like wood, PVC or aluminium. The 
creation of vacancies in a Sonic Crystals improves the 
attenuation capability of the Sonic Crystals (Caballero 
et al., 2001). However, it does not exist any generic 
rule about the creation of vacancies in a Sonic Crystals. 
In fact, similar structures can produce very different 
acoustic fields behind of them. 

Because of the complexity of mathematical functions 
involved in Sonic Crystals calculus, Genetic Algortihm 
turns up as a tool specially indicated for this kind of 
problems (Hakanson et al., 2004) (Romero-Garcia et 
al., 2006). This procedure can work together with the 
Multiple Scattering theory which is a self-consistent 
method for calculating the acoustic pressure including 
all orders of scattering (Chen & Ye, 2001). Given a 
starting Sonic Crystals, the Genetic Algorithm gener- 
ates quasi ordered structures offspring by means of 
the creation of vacancies that are classified in terms 
of a cost function based on the pressure values at a 
specific point. The sound scattered pressure by every 
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structure analyzed by Genetic Algorithm is performed 
by a two-dimensional (2D) Multiple Scattering theory. 
In the present work, it is shown an improvement of the 
Genetic Algorithm based on Parallel implementation 
and as a consequence, new and better results are ob- 
tained to design Quasi Ordered Structures made with 
rigid cylinders that attenuate sound in a predetermined 
band of frequencies. 



SONIC CRYSTALS 

Sonic Crystals are arrays of scatterers placed periodi- 
cally in space whose physical properties are different to 
the surrounding material. In the low frequency range, 
Sonic Crystals behave as an homogeneous medium 
with an acoustic impedance greater than that of the air. 
Then Sonic Crystals can work as refractive devices. 
Moreover, Sonic Crystals present band gaps, i.e., ranges 
of sound frequencies where the sound propagation 
inside the crystal is forbidden. The presence of these 
band gaps is explained by the well-known Bragg's law. 
The reflections inside the crystal, and consequently the 
position of the gaps depend on the lattice constant, i.e., 
on the geometry of the Sonic Crystals. The existence, 
in periodic media, of an absolute band gap where the 
propagation of sound is forbidden for every incidence 
direction, can have a profound impact on several sci- 
entific and technological disciplines, for example, in 
the design of acoustic filters or acoustic barriers. 

Some studies have showed that there are three 
important parameters for the spectral gap creation 
(Economolu & Sigalas, 1994). One is the density 
ratio y = pjp h between the scattering material and the 
host material densities. The second one is the filling 
factor, ff = Vs/V , that shows the volume occupied by 
the scattering material respect to the total volume. 
The last parameter is the topology used to design the 
Sonic Crystals. It was demonstrated that the density 
ratio plays an important role in the gap creation: Sonic 
Crystals built with scatterers of high density embedded 
in a host material of low density are better to create 
the spectral gap than another kind of configurations. 
Moreover the optimum value of the filling factor, ff, 
to the gap creation has been ranged between 10% and 
50%. In this work we use a Sonic Crystals built by 
aluminium cylinders of 2 cm of radius as scatterers 
embedded in air (Network topology). Due to the fact 
that those structures present a high density ratio, and the 



maximum filling factor is ff- 0,36, we ensure that our 
structure is well designed to the gap creation. Now we 
want to find the best filling factor and space distribution 
of scatterers that present the best acoustical properties. 
Genetic Algorithm together with the MST is a good 
procedure to achieve our objective. 



COST FUNCTION AND CHROMOSOME 
DESCRIPTION 

The mechanism used by Genetic Algorithm in this work 
is the creation of vacancies in the starting Sonic Crystals. 
Fig. 1 shows the starting Sonic Crystals and a Quasi 
Ordered Structures offspring generated by Genetic 
Algorithm by means of the creation of vacancies. Using 
this procedure we can vary the filling factor and, at the 
same time, evaluate different spaces of configuration. 
Each Quasi Ordered Structures will be considered as an 
individual. The chromosome that represents each Quasi 
Ordered Structures, is a real vector with values in [0; 
1] range. Each coordinate represents the existence or 
not of a cylinder at a specific position of the scatterer 
(beginning with the cylinder a the left top corner of the 
Sonic Crystals and following by columns until right 
bottom corner, see starting Sonic Crystals at figure 1). 
Values in [0; 0:5 [ means there is a vacancy, in oppo- 
sition values in [0:5; 1] means there is a cylinder. In 
this work we are interested in maximizing the sound 
attenuation for a predetermined range of frequencies 
not dependent on the lattice constant, at a point located 
behind the crystal. 

The acoustic attenuation in a point (x, y) and for a 
incidence frequency v is: 



Atenuacion (dB) = 201og 



1 



VI -Pi^fradfe y, Xcil, Ycil, v,ri) 



where the interfered pressure is determined by the 
MST. This pressure depends on the position and on the 
radius of the scatterers and the incidence frequency. 
In the equation (1) we can see that for a point (x, y), a 
value of incidence frequency v and a value of cylinder 
radius r p it is possible to find a configuration of cylinders 
that minimize the P. , ,, that means, maximize the 

mteq erred 7 7 

acoustic attenuation. 

If we are interested in maximizing the sound at- 
tenuation in a predetermined range of frequencies at 
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Figure 1. Starting sonic crystals and a possible quasi ordered structures offspring 
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a point of coordinates (x, y) we have to define a new 
function that we have to minimize in order to achieve 
the maximum acoustic attenuation. To do that, we define 
our cost function based on the MST 
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represents the mean pressure in the range of frequencies 
[v^* v N ] and JV represents the number of frequencies 
considered in this range. In our case, we use N = 13. 
The second term in equation (2) represents the mean 
deviation. The variable under study is x=(Xcyl,Ycyl) 
a vector that contains the information about the space 
configuration of the Quasi Ordered Structures. 



PARALLEL GENETIC ALGORITHM 

A Genetic Algorithm is an optimization technique that 
looks for the solution of the optimization problem, 
imitating species evolutionary mechanism (Goldberg, 
1989). 



In an optimization problem, there is a function to 
optimize (cost function) and a zone where to look for 
(search space). Every point of the search space had an 
associated value of the function. The different points 
of the search space are the different individuals of 
population. Similarly to natural genetic, every differ- 
ent individual is characterized by a chromosome and 
in the optimization problem, this chromosome is made 
by the point coordinates in the search space. 

The cost function value for an individual has to be 
understood as the adaptation level to the environment 
for such individual. 

Evolutionary mechanism, that is, the rules for chang- 
ing populations throughout generations is performed 
by Genetic Operators. A general Genetic Algorithm 
evolution mechanism could be described as follows: 

From an initial population (randomly generated), the 
next generation is obtained as: 

1 . Some individuals are selected for the next genera- 
tion. This selection is made depending on adapta- 
tion level (cost function value). Such individuals 
with better J(x) value have more possibilities to 
be selected. 

2. To explore search space, an exchange of informa- 
tion between individuals is performed by cross- 
over. That produces a gene exchange between 
chromosomes. The rate ofindividuals to crossover 
is fixed by P c , crossover probability. 
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3. An additional search space exploration is per- 
formed by mutation. Some individuals are subject 
to a random variation in their genes. The rate 
of individuals to be mutated is set by mutation 
probability P m . 

In this general framework, there are several variation 
in the Genetic Algorithm implementation; different gene 
codification, different genetic operator implementation, 
etc. Implementation for the present work has the fol- 
lowing characteristics: 

1. Real value codification, each gene has a real 
value, the interpretation of the chromosome has 
been detailed in previous section. 

2. J(x) is not directly used as cost function. A linear 
'ranking' operation is performed (Back, 1996). 
Ranking operation prevents the algorithm from 
exhausting, it avoids clearly dominant individuals 
prevailing too soon. 

3. Selection is made by the operator known as 
Stochastic Universal Sampling (SUS) (Baker, 
1987). 

4. For crossover it is used intermediate recombina- 
tion operator (Miihlenbein et al., 1993). Chro- 
mosomes sons (x\ and x' 2 ) are obtained through 
following operation on chromosomes fathers (x : 
and x 2 ): 

x\ = a x • x x + (1- a x ) x 2 ; x' 2 = a 2 • x 2 + (1- a 2 ) x x ; 
a 1? a 2 , g [-d, 1+d] 

a : and a 2 have to be generated for each gene 
increasing search capabilities but with a higher 
computational cost. Implemented Genetic Algo- 
rithm has been adjusted as follows: a=a 2 and 
generated for each chromosome, d = and Pc = 
0,8. 

5. Mutation operation is done with a probability 
Pm = 0,1 and a normal distribution with standard 
deviation set to 20% of search space range. 

The high computational cost of Sonic Crystal opti- 
mization problem produces huge execution time, i.e. in 
a standard execution (population of 360 individuals, 250 
generations) time is around 104 hours. Improvements 
of execution time have been obtained with a parallel 
implementation of the Genetic Algorithm described. 
Several alternative for parallelization are possible 



(Cantu-Paz, 1995) the selected one is the configura- 
tion Master-Slave. For this architecture there is one 
processor working as Master, executing tasks of the 
Genetic Algorithm (ranking, selection, crossover and 
mutation), and the rest evaluate fitness function of a 
subpopulation (see Fig. 2). 

The Master has to send subpopulation to each Slave, 
who makes fitness evaluation and returns results to the 
Master. The Master works in a synchronous way, wait- 
ing for all fitness value from all Slaves. After receiving 
all fitness values the Master performs the evolution 
to produce the next generation (genetic operators are 
executed) and sends to the Slaves the new population 
for fitness evaluation. This type of implementation is 
the most simple and does not change Genetic Algorithm 
operators and behaviour. The time reduction is signifi- 
cative since the overall time is divided by the number 
of Slaves. For the problem proposed, with 5 Slaves, 
the total execution is reduced to 21 hours. 

All developments (Genetic Algorithm and Sonic 
Crystals models) have been made in Matlab®, paral- 
lelization has been done using Matlab Distributed 
Computing Toolbox and Matlab Distributed Comput- 
ing Engine. 



RESULTS 

In this point we present some of our main results. In 
this work we have analyzed width ranges of 600 Hz 



Figure 2. Master/slave architecture for parallel genetic 
algorithm 
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centered at several frequencies (800, 1100, 1300, 1700, 
2000, 2300, 3090 Hz) above the first Bragg's peak. On 
the Fig. 3 we present the results corresponding to the 
ranges of frequencies centered at 1700 and 3090 Hz 
respectively. On the left hand of the Fig. 3 we present 
the schemes of cylinders of the Quasi Ordered Structures 
generated by the design tool described above. On the 
right hand the acoustic attenuation spectra calculated 
by the MST for the starting Sonic Crystals (continuous 
line) and for the optimized Quasi Ordered Structures 
(dashed line) is shown. 

The creation of attenuation peaks in ranges of fre- 
quencies independents on the geometry of the starting 
Sonic Crystals using rigid scatterers has been the goal 



of this paper. As one can see on the Fig. 3, the peak at- 
tenuation in the spectra of the optimized Quasi Ordered 
Structures appears in the chosen frequency range, and 
this peak is absent in the spectra of the starting Sonic 
Crystals. Notice that the acoustic attenuation level in 
the frequency range in the starting Sonic Crystals is 
much lower than the Quasi Ordered Structures one. 
Even in some case the starting Sonic Crystals produces 
sound reinforcement. Moreover, the total number of 
cylinders in the optimized Quasi Ordered Structures 
is also lower than the starting Sonic Crystals one. In 
our results the number of cylinders is ranged between 
36.7 % and 60%. 



Figure 3. Optimized Quasi Ordered Structures and its spectrum. On the left hand the plot presents the schemes 
of cylinders of the optimized Quasi Ordered Structures. On the right hand the plots show the acoustic attenua- 
tion spectra calculated by the MST for the starting Sonic Crystals (continuous line) and for the optimized Quasi 
Ordered Structures (dashed line), (a) Optimization corresponding to the central frequency of 1 700 Hz. (b) Op- 
timization corresponding to the central frequency of 3000 Hz. 
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These results constitute a useful tool to design acous- 
tic barriers based on Sonic Crystal with no need for 
sophisticated scatterers. The technological advantages 
of using Quasi Ordered Structures with rigid cylinders 
as scatterers are: high resistance for use outdoors, con- 
structive simplicity and low cost due to the reduction 
in volume of the crystal. 



CONCLUSION 

This work shows an important and successful appli- 
cation of a Genetic Algorithm with a parallel imple- 
mentation. Sonic Crystals open the way for innovative 
application in noise reduction in several interesting 
areas as acoustic noise barriers for traffic or general 
devices for controlling the noise. The Genetic Algo- 
rithm demonstrates an adequate optimization for a so 
complex problem and with the parallel implementation 
execution times are drastically reduced. Moreover, this 
method offers the possibility to test a wide range of 
Sonic Crystals adjustment in a reasonable time. 
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KEY TERMS 

Acoustic Attenuation Spectrum: Representa- 
tion of the attenuation contribution of each acoustic 
frequency to a sound. 

Cost Function: Mathematical function to minimize 
in an optimization problem. 

Evolutionary Mechanism: Mechanism guided 
by biological evolution which represents the rules for 
changing populations throughout generations. 

Filling Factor: Volume fraction occupied by the 
scattering material. Defined as, fl=V7V, where V is 
the total volume of the composite, and V s the volume 
of the scattering material. 



Genetic Algorithm: Global search method based 
on a simile of the natural evolution. 

Quasi Ordered Structure: Given a starting Sonic 
Crystal (see Sonic Crystal), a quasi ordered structure 
(Quasi Ordered Structures) is the configuration of 
scatterers resulting of the creation of vacancies in the 
Sonic Crystal. 

Search Space: Set of all possible situations of the 
problem that we want to solve could ever be in. 

Sonic Crystal: Arrays of scatterers placed periodi- 
cally in space whose physical properties are different 
to the surrounding material. 
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INTRODUCTION 

Particle Swarm Optimization (PSO) is a simple but 
powerful optimization algorithm, introduced by Ken- 
nedy and Eberhart (Kennedy 1995). Its search for 
function optima is inspired by the behavior of flocks 
of birds looking for food. 

Similarly to birds, a set (swarm) of agents (particles) 
fly over the search space, which is coincident with 
the function domain, looking for the points where the 
function value is maximum (or minimum). In doing so, 
each particle's motion obeys two very simple differ- 
ence equations which describe the particle's position 
and velocity update. 

A particle's motion has a strong random compo- 
nent (exploration) and is mostly independent from the 
others'; in fact, the only piece of information which 
is shared among all members of the swarm, or of a 
large neighborhood of each particle, is the point where 
the best value for the function has been found so far. 
Therefore, the search behavior of the swarm can be 
defined as emergent, since no particle is specifically 
programmed to achieve the final collective behavior 
or to play a specific role within the swarm, but just to 
perform a much simpler local task. 

This chapter introduces the basics of the algorithm 
and describes the main features which make it particu- 
larly efficient in solving a large number of problems, 
with particular regard to image analysis and to the modi- 
fications that must be applied to the basic algorithm, in 
order to exploit its most attractive features in a domain 
which is different from function optimization. 



BACKGROUND 

One of the most attractive features of PSO, apart from 
its effectiveness and robustness with respect to local 



minima, is certainly its simplicity, which makes it 
trivial to implement in any programming language. It 
is also very versatile and applicable to a large number 
of optimization problems, virtually to any problem 
defined within a space for which a metric can be de- 
fined. However, its behavior, which mainly depends 
on the values of three constants, is still far from being 
fully understood. Extensive work (Engelbrecht2005, 
Clerc2006, Poli2007a) has provided very important 
insights into the properties of the algorithm, in studies 
where the dynamic properties of the swarm have been 
studied, even if under some restrictive assumptions. 

The model which underlies PSO describes the mo- 
tion of a swarm of particles within the domain of a func- 
tion, usually termedfitness function as for evolutionary 
algorithms (Eiben 2004, de Jong 2006), seeking for its 
optimum. Such a motion is comparable to the random 
motion of a set of independent non-interacting particles 
within a force field generated by two attractors, one of 
which is specific to each cell. 

The basic PSO equations for a generic particle P 
within the swarm are 

X p (t) = X p (t-l) + v p (t) (1) 

v p (t) = co * v p (t-l) + C, * rand() * [X pbest - X(t-l)] + C 2 

(2) 



*rand()*[X gbest -X{tzl}] 



where v p is the velocity of particle P, C 1 and C 2 are two 
positive constants, co is the so-called inertia weight, 
X p is the position of particle P, X pbest is the best-fitness 
point reached by P up to time t-1, X best is the best- 
fitness point found by the whole swarm, rand() is a 
random value taken from a uniform distribution in the 
interval [0,1]. 

In its motion, the swarm explores the space ef- 
fectively, usually converging rapidly to the optimum, 
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even if its behavior is strongly dependent on the values 
of co, C 1? and C 2 , which must be therefore set very ac- 
curately. 



PARTICLE SWARM OPTIMIZATION AND 
IMAGE ANALYSIS 

Even if much is still to be learned and discovered about 
PSO from a theoretical point of view (Kennedy 2007), 
as regards applications PSO is gaining more and more 
popularity. As reported in (Poli2007b), a very recent 
in-depth review of the field, searching the IEEExplore 
(http://ieeexplore.ieee.org) technical publication 
database by the keyword PSO returns a list of much 
more than 1,000 titles, about one third of which deal 
with theoretical aspects. This means that, to date, an 
incomplete list of PSO application papers adds up to 
little less than 1,000. Amazingly, about two thirds of 
them have been published in the last two years. 

Image analysis is one of the fields to which PSO 
is being applied most frequently. As shown by a large 
number of papers in the image processing and computer 
vision literature, image analysis problems can be often 
reformulated as optimization problems, in which an 
objective function, directly derived from the physical 
features of the problem, is either maximized or mini- 
mized. In most cases, an optimum set of parameters 
which define the solution are sought using an optimi- 
zation method. For most real-world problems, usually 
severely affected by noise or by the natural variability 
of the instances of the objects which must be detected, 
this is often inevitable, since methods in which closed- 
form solutions are directly applied are not usually 
robust enough with respect to such features. A large 
number of examples of applications of both traditional 
and evolutionary optimization methods including, as 
such, PSO, are reported in the literature. 

In this section we will not consider direct applica- 
tions of PSO as optimizers for an objective function. We 
will focus our attention on applications in which PSO 
is not only a way to 'tune' a more general algorithm 
by adapting it to the specific features of the problem 
at hand, but is directly part of the solution. 

We will first introduce some general considerations 
on image analysis problems, which define the require- 
ments imposed by them. This will allow us to reformu- 
late some typical classes of problems encountered in 
image analysis, such as object detection and tracking or 



image segmentation, to include PSO, or some adapted 
version of its basic formulation, into the solution. We 
will then briefly show two examples of applications of 
PSO to segmentation and object detection, in which 
the above mentioned considerations have been taken 
into account. 

PSO for Object Detection and 
Segmentation 

In considering the application of PSO to image analysis 
tasks, one could assume the swarm to fly over the im- 
age to detect points or regions of interest. Therefore, 
the domain of the fitness function becomes the image 
itself. The fitness value to be assigned to each point can 
then be defined as a local function of image intensity 
in a neighborhood of that point, returning high values 
in points where features similar to the ones which are 
sought are found. 

However, more global information must usually 
be extracted in image analysis tasks. In fact, while 
the basic PSO algorithm aims at finding a single opti- 
mum within the fitness landscape under exploration, 
in several image analysis applications more than one 
optimum (multiple objects) are to be found. This situ- 
ation is typical of object recognition tasks, where the 
goal is to identify all possible occurrences of an object 
of interest characterized by a set of specific features. 
Similarly, in region-based segmentation, several regions 
with homogeneous features must be accurately located. 
Such requirements, encountered also in many other 
application areas, have led to the definition of several 
variants of PSO, in which particles are subdivided into 
a predefined number of sub-swarms, based on some 
clustering technique (Kennedy 2000, Veenhuis 2006, 
Passaro 2008), or through speciation (Chow 2004, Bird 
2006, Leong 2006, Yen 2006), to achieve a dynami- 
cal reconfiguration of the swarm and the detection of 
an arbitrary number of regions of interest within the 
search space. 

The velocity update function must also be modi- 
fied in order to let the swarm spread as uniformly as 
possible over a whole area of interest featuring high 
fitness values. Such modifications may include intro- 
ducing repulsive forces between particles, to prevent 
the whole swarm from converging onto the same 
point, and limiting particles' mobility inside a region 
of interest, to keep the swarm compact and in a stable 
configuration. 
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We will first show how these ideas can be applied to 
two common image analysis problems: region segmen- 
tation and object detection. Then we will show results 
obtained in two real-world problems: the first one was 
proposed as topic for a competition at GECCO 2006, 
and consists of detecting and segmenting as precisely 
as possible large pieces of pasta imaged over a set of 
noisy backgrounds over which also tiny pasta pieces 
are scattered, which must be ignored (see Figure 1). 
The second problem is a sub-task of plate recognition, 
in which the region occupied by a license plate is to 
be located within an image (see Figure 2). Even if the 
two tasks are semantically different, they share some 
common lower-level features, which allow the same 
modifications to basic PSO to be used in both cases, 
with a two-step approach. In the basic step, the im- 
age is explored, to focus on regions where interesting 
features are detected, before a refinement occurs in the 
subsequent step. 

Modified PSO Equations for Image 
Analysis 

In basic PSO, the fitness function is evaluated point by 
point. In analyzing images using PSO, the search space 
being the image, using such a local fitness function 
would make the search extremely sensitive to noise 
and possibly misleading. If fitness evaluation were 
just pixel-based, a meaningless isolated pixel yielding 
high fitness as a result of noise could attract and trap 
the whole swarm into its neighborhood. 

To allow PSO to produce a uniform distribution of 
particles over each region of interest, the basic PSO 
algorithm can be modified in two directions: 

Forcing division of the swarm into sub-swarms, 

able to converge towards different regions of 

interest, 

Favoring dispersion of the particles all over the 

regions of interest. 

Using the so-called K-means PSO (Passaro 2008), in 
which clusters of particles form based on their proximity 
within the search space, the former goal can be achieved. 
To achieve the latter, both the fitness function and the 
velocity-update equation must be modified. 

As concerns the fitness function, a local fitness term, 
which evaluates how "interesting" the neighborhood of 
one pixel is, can be added to a punctual fitness function 



term, whose value is computed based only on informa- 
tion carried by the pixel under consideration: 

fitness(x,y) = punctual Jitness(x,y) 
+ local Jrtness(x,y) 

The local Jitness term depends on the number of 
particles, with high punctual fitness, which are neighbors 
of the pixel located in (x,y), and is given by: 

local Jitness = K Q * number _of _neighbors 

where number_of_neighbors is the number of particles 
within a pre-defined neighborhood of (x,y) and K Q is 
a constant. 

This way, the particles are attracted towards the 
areas where more pixels meet the punctual require- 
ment, keeping away from isolated noisy pixels. This 
modification enhances the density of particles in the 
most interesting regions. To cover the whole extension 
of these regions, also the basic PSO velocity-update 
equation needs to be modified from (1) to: 

Y p *(t) = v p (t) + repulsion F 

The repulsion term can be expressed as 

\ repulsion (Uj)\ = REPULSIONRANGE - |X.- X.| 

where z and j are the particle indices and REPUL- 
SIONRANGE is the maximum distance within 
which the particles interact. Values of repulsion(ij) 
are set to for distances between i and j larger than 
REPULSIONRANGE. The global repulsion term 
repulsion p for particle P is the average of all repulsion 
terms acting on it 

repulsionP = ( S. =1 N repulsion (Pj) ) I n 

N being the number of particles in the swarm and n 
the number of particles within the neighborhood of P 
defined by REPULSIONRANGE. 

Finally, to produce more stable sub-swarms, a 
particle with high punctual and local fitness is allowed 
to stand still with a probability which is linearly de- 
pendent on the particle density in its neighborhood, 
estimated as 

P{v p (t) = 0} = n/N 
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REAL-WORLD EXAMPLES 
Pasta Segmentation 

In a color-based region segmentation problem, the 
fitness function measures the similarity of the pixel 
color to the expected color of the objects of interest. 
For pasta, it can be expressed as: 

if (\r(x,y)-g(x,y)\ < 30 and r(x,y)-b(x,y) > 60) 
then punctual Jitness - 30 - \r(x,y) - g(x,y)\ 
else punctual Jitness - 

where r(x,y), g(x,y) and b(x,y) are the red, green, and blue 
values, respectively, of the pixel located in (x,y). 

Since the goal is to obtain an accurate segmentation, 
up to pixel precision, and given the large number of 
pixels belonging to the objects of interest, PSO cannot 
obviously produce the final solution directly. Instead, it 
can be used in a pre-processing stage preceding a final 
thresholding stage which produces the actual output. 



Following the PSO rules modified as previously 
described, the particles will tend to move towards larger 
pasta regions and stay around there. If one performs a 
number of PSO runs, assigning to each pixel a score 
which is directly proportional to the number of times 
a particle walks through it, the probability of belong- 
ing to a large pasta piece can be estimated for each 
pixel. To better estimate such a probability, avoiding 
bias deriving from the initial particle locations, each 
run should start with a different random initialization 
of the whole swarm. Image regions which eventually 
have high density of high-score pixels correspond to 
pieces of pasta. The final result of this stage, that we 
termed global search, is a preliminary segmentation by 
which the areas where large pieces of pasta are most 
likely to be found are grossly detected. To refine the 
segmentation, an algorithm which is very similar to 
the one used in the previous stage is applied; this time 
the domain where the swarm can move is limited to 
smaller regions surrounding pixel clusters whose score 
was above a threshold in the last phase of the global 
search. The final segmentation is eventually obtained 



Figure 1. Pasta segmentation. Top: Original image (left) and results of global search (right). Bottom: Results 
of local search (left) and final segmentation (right). 
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by thresholding the locally updated scores to obtain 
a binary image. Figure 1 shows the results obtained 
on one of the images from the image set used in the 
competition. 

Plate Detection 

In the license plate detection problem, the low-level 
feature on which detection is based is the density of 
high-level values of the horizontal gradient, due to the 
presence, in the plate, of symbols or symbol elements, 
which can be encountered when the image is scanned 
row- wise. Since a color image is available, we can use 
both color and gradient information, by first considering 
only those pixels which satisfy the typical features of 
plates (black characters on a white background for the 
most recent European standards), and then considering 
gradient information. 

The punctual fitness of a pixel is defined as: 

if ( \r(x,y) - g(x,y)\ > 30 or \r(x,y) - b(x,y)\ > 30 or 
\g(x,y)-b(x,y)\>30) 

punctual Jitness = 0; 
else 

{right jgradient = \intensity(x,y) - 
intensity '(x+l,y)\; 

leftjgradient = \intensity(x,y) - 
intensity (x-l,y) \; 

if (right jgradient > leftjgradient) 
punctual Jitness = right jgradient; 

else punctual Jitness = leftjgradient; } 

The basic PSO step is virtually the same as in pasta 
segmentation. However, a different algorithm is used, 
divided, as well, into a global and a local exploration 
stage in which, after the most promising areas are firstly 



located, the exploration is then refined to determine 
whether they actually include a plate. 

In the global search, the swarm flies over the image 
until at least one sub-swarm of size greater than a pre- 
fixed threshold (50% of the whole swarm) has formed 
or a given number of iterations has been reached. 

Then a local search is performed within regions 
where sub-swarms of sufficient dimension have formed, 
starting from the region occupied by the largest swarm; 
during this second stage: (i) the search is restricted to 
smaller image regions of interest enclosing the sub- 
swarms, (ii) the search is re-initialized activating a new 
full-size swarm in the region of interest, and (iii) the 
search is run for a pre-set number of iterations. At the 
end of this stage, a new bounding box, containing all 
particles, is computed. If this box has an aspect ratio 
compatible with a license plate, the plate is considered 
to have been found. Otherwise, the swarm is expanded 
along its two dimensions, by forcing low-fitness par- 
ticles to move only horizontally or vertically, in order 
to reach higher-fitness points and, possibly, to let the 
bounding box reach the expected aspect ratio; in case 
of failure, the current region is discarded and the next 
area detected during the global search is explored. 

Figure 2 shows the original image, along with the 
results of the global and local search, and the final 
result of the PSO-based algorithm. 

The algorithm is computationally very efficient. A 
number of function evaluations is required to detect the 
plate, which is lower than just computing the whole 
gradient image, which would be just the very first step 
in any 'traditional' computer vision approach. Iteratively 
re-initializing, in each frame, the swarm location in a 
neighborhood of the region where the plate has been 
detected in the previous one, real-time performances 
can be achieved in tracking the plate in videos acquired 
at 30 frames per second using a standard PC. 



Figure 2. License-plate detection; Original image (left) and results of the detection (right) 
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The same cannot be said for the pasta segmentation 
algorithm if high segmentation accuracy is required 
(about 30 seconds were needed to produce the seg- 
mentation in Figure 1 on a 2.8 GHz PC). However, 
even in that case, if the pieces are just to be grossly 
located, just a few runs of the algorithm are enough to 
achieve the goal. 



FUTURE TRENDS 

Research on PSO and PSO applications to the most 
various fields is booming nowadays. Image analysis 
is no exception: according to the INSPEC bibliogra- 
phy database, the number of papers which describe 
applications of PSO to such a field has increased by 
almost 50% in the last six months. Results are already 
very encouraging and suggest that much more is to be 
expected in the near future. 



CONCLUSION 

PSO is a versatile and effective optimization technique 
whose features can be easily adapted to a vast variety 
of problems, in solving which it can act not only as a 
"plain" optimizer, but as a more general, flexible search 
paradigm. The applications described in this chapter 
have confirmed this, introducing a general framework 
which can be applied, with few changes, to many other 
object detection and recognition problems, as well as 
to other lower-level tasks in computer vision, such as 
image segmentation. 
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KEY TERMS 

Evolutionary Computation: Collection of tech- 
niques, basically aimed at function optimization but 
applicable to a huge variety of problems, by which 
the optimum of a function (fitness function) is sought 
through iterative refinements, according to rules inspired 
by the laws of natural evolution. 

Fitness Function: In evolutionary computation, the 
objective function which is to be optimized. 

Image Analysis : Collection of techniques by which 
high-level information content is extracted from a 
digital image using image processing and computer 
vision techniques. 



Particle Swarm Optimization: Optimization tech- 
nique inspired by the exploratory behavior of animal 
swarms/flocks/herds in search of food. 

Segmentation: In computer vision, a process by 
which an image is subdivided into regions having 
homogeneous visual features. 

Sub-Swarm: In particle swarm optimization, subset 
of a swarm, within which the distance between any par- 
ticle and the closest one is below a pre-set threshold. 

Swarm Intelligence Collection of techniques, usu- 
ally inspired by nature, in which high-level intelligent 
behaviors emerge as a result of the interaction among 
a high number of agents which, individually, perform 
apparently trivial, low-level tasks. 
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INTRODUCTION 



Decision support systems (DSS) are computerized 
systems that assist humans to make decisions. Early 
versions were designed for executives, but over time 
DSSs were designed for workers at any level in the 
organization (Keen & Morton, 1978; Rockart, 1979). 
Due to increasing costs in providing benefits and ser- 
vices, organizations are forcing workers and consumers 
to take increasing responsibility for insurance, health 
care, and financial planning decisions. Extreme events, 
such as terrorism, pandemics, and natural disasters will 
swamp the capacity of governmental agencies to serve 
their citizenry. Individuals in affected communities must 
turn to local agencies or ad hoc groups for assistance. 
Personal decision support systems (PDSS), consisting 
of databases, model-based expertise, and intelligent in- 
terfaces, along with wireless communications, Internet 
resources, and personal computing, provide sufficient 
resources to assist informed individuals and groups in 
solving problems. 

This article reviews the typical components of 
a DSS and the different types of systems that have 
evolved. The article poses three types of problems 
facing individuals, including routine problem solving, 
immediate survival needs, and long-term evolutionary 
growth. Personal decision support issues of acquiring 
information, processing information, and dissemination 
are outlined. Future trends and research opportunities 
are discussed. 



BACKGROUND 



a means to interact with the other system components 
(Sprague, 1980). 

Powers (2007) characterized DSS in terms of how 
the system provides assistance. Model-driven DSSs 
for individuals include spreadsheets. Data-driven DSS, 
such as Executive Information Systems (EIS), are 
used by organizations and institutions for strategic and 
tactical decisions. Communication-drivenDSSscanbe 
seen in groupware, video conferencing, and bulletin 
boards. A document-driven DSS, such as provided 
by search engines, facilitates document retrieval. A 
knowledge-driven DSS would be used to solve special- 
ized problems and consist of knowledge represented 
in terms of rules, procedures, hierarchical frames, or 
networks. Most recently, web-based DSSs are found 
in browser searching, intranets, and portal use. 

Decision support systems are based on the notion 
that human reasoning is a rational process, although 
this is not always the case particularly when humans are 
faced with complexity and stress (Druzdzel & Flynn, 
2000). Experts' decisions in real settings have been 
shown to demonstrate less quality than linear models 
(Hastie & Dawes, 2001). Judgmental heuristics reduce 
cognitive load but decrease the quality of decisions. 
Characteristics of the DSS components vary in a PDSS 
in order to compensate for the type of problems faced 
by individuals. In general for a PDSS the data bases 
are customized, the model bases are organized along 
preferential outcomes (e.g., more or less, quantitative), 
decisions (e.g., lists and value ordering), and uncer- 
tainty (specific actions resulting in gain considering 
constraints and price). 



DSS aid human thinking by accessing information, 
integrating this information in some way, structur- 
ing decisions, and optimizing decisions (Sprague & 
Carlson, 1982). These benefits are obtained using 
three major system features of a DSS, which include 
a database, which records knowledge; a model base, 
which models or represents expertise and problem- 
solving; and an interface, which provides a user with 



PERSONALIZED DECISION SUPPORT 

This article summarizes three problem types facing 
individuals, including routine problem solving, extreme 
survival needs, and long-term change. The article 
outlines system architecture requirements in terms of 
acquiring and processing of information, interacting 
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with this information, and the dissemination of infor- 
mation and recommendations. 

PDSS Problem Types 

The consumer of the 21 st century faces numerous rou- 
tine problems, such as career choice, self-improvement, 
volunteerism, financial planning, retirement, insurance, 
consumer purchases, health care physician, and per- 
sonal health. PDSS applications can be seen in health 
care ranging from point-of-care use of personal data 
assistants (PDA) to helping patients make decisions on 
health care (Crawford, 1997; Pierce, 1998). Routine 
problems consist of complex options with short-term 
benefits and unknown long-term implications. However, 
individuals tend to discount the need to make decisions 
and/or the belief that institutions and governmental 
agencies will impose decisions on them. 

A second problem type can be classified as survival. 
Three examples include natural disasters, terrorism, 
and pandemics. Natural disasters, such as hurricanes, 
tornadoes, floods, drought, volcanic eruptions, earth- 
quakes, and meteorite impacts, can also include gradual 
changes brought about by global warming. Radical 
changes could involve results of nuclear winter, the 
shift of the moon's orbit, or pole shifting of the earth's 
magnetic field. PDSS applications involve disaster 
management and attempts to connect satellite map- 
ping technology with government agencies (Hegde, 
Srivastava & Manikiam, 2004). Terrorism provides 
a more recent survival problem brought about by 
racial cleansing, violence between religious groups, 
undermining of governments through corruption and 
assassination, chemical warfare, and destruction of 
neighbourhoods and infrastructure. PDSS applications 
for this problem type has emerged for counter-terrorism 
applications (Alward, 2004). Pandemics have always 
occurred throughout human history but have taken on 
serious implications given technological developments 
in genetics. Survival problems cannot be predicted, 
fully characterized, and their impact overwhelms the 
capacity of a DSS. The value of a PDSS is its proac- 
tive potential by identifying national, state, and local 
resources, recommending action, and triggering the 
development of institutional support and awareness 
that did not exist before. 

A third problem type is evolutionary or long-term 
change brought about by a realization that existing 
decision paths may lead to significant consequences. 



Awareness of change problems signal a need for people 
to make long-term proactive decisions in light of mul- 
tiple paths or scenarios (Schellnhuber, Crutzen, Clark, 
Clausssen, & Held, 2004). Proactive decision-making 
enables humans to become aware of and address serious 
consequences of prior decisions by individuals, groups, 
institutions, and governments, as well as the impact of 
technological innovations. However, change problems 
tend to be low priority, require significant resources, 
and they resist consensus due to their apparent intrac- 
tability. Simulations and virtual environments may be 
needed to help citizens interact with potential paths 
(Stanney, 2002). 

Personal Decision Support Architecture 

Early views defined a personal DSS as one which 
focused on a discrete task or decision (Rockart & 
Bullen, 1986). Examples frequently involved group 
support, such as Morton's (1971) DSS which involved 
both marketing and production planning. Keen and 
Hackathorn (1986) identified three main parts of a 
personal DSS to include the interface between machine 
and user, relevant operators (i.e., action verbs, such as 
"help"), and a database. Development of a personal 
DSS requires attention to dialogue, refinement of the 
vocabulary-operators, and evolution of the data struc- 
ture of the database. 

PDSS, as described here, would involve both in- 
dividual and social needs, and thus would be hybrid 
versions of several DSS types (Powers, 2007). APDSS 
would include mathematical and statistical tools (model- 
driven) to calculate and make inferences on numeri- 
cal data. They would retrieve forms and information 
(document-driven) to support decision-making. They 
would use information and data as input to address 
specialized needs (knowledge-driven), such as health 
care, insurance, career options, and travel planning, 
among others. The PDSS would consist of both localized 
(personal computer system) resources and distributed 
(web-driven) sources where information and computing 
may be conducted at other sites. 

The major systems of a PDSS include databases, 
reasoning models, interface, and communication op- 
tions. Each of these four systems can be equated to 
acquiring information, processing this information in 
ways that make it amenable to specialized decision 
modules (e.g., insurance, health-care, travel planning), 
interacting with the information visually, and com- 
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Figure 1. PDSS system features 
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municating or sharing decisions or information with 
others (see Figure 1). 

Acquiring Information 

Databases provide a repository for information within 
any DSS. A personalized version of a DSS would com- 
bine local databases, which are developed individually 
for specific needs, with remote integrated databases. 
These databases would consist of inconsistent struc- 
tures, while in the long term some standardization 
of database structure would be required to develop a 
personalized integrated database. In addition, ad hoc 
browsing tends to characterize individual information 
needs with little regard for organizing this information 
over the long term. 

Processing Information 

One of the powerful features of a DSS is its model 
base. Modeling allows knowledge to be applied across 
problems and facilitates analysis, explanations, and 
advocacy (Druzdzel & Flynn, 2000). A model base 
would include one or more models or representations 
of expertise ranging from highly specialized (e.g., 
resale home value) to more general (e.g., model of a 
learner). Model bases might become object-oriented 
and incorporated into a PDSS like a software plug-in 
as needed. Generic versions of a PDSS might include 
a range of common model components for financial, 
employment, travel, and health needs and provide 
simulations to help a user see the implications of 
decisions. Integrating model bases, as with databases, 
will require some standardization of model structure 
along some common categories. Personal patterns of 



reasoning may also be archived to provide speed and 
options for new problems. 

The most important and the most challenging to 
archive and characterize would be context information, 
an example of unstructured data. A top-down version 
of a system that would increase the structure of the 
context-data would be to categorize specific routine 
contexts, such as financial, health, college selection. 
Extreme survival categories could include natural dis- 
aster and other types of emergencies, crime and terror- 
ism, and pandemics. A bottom-up version of a context 
representation system would be to identify patterns of 
information using semantic webs (Hadrich & Priebe 
(2005), and over time a context-map would be built to 
characterize particular categories of context. 

Interacting with Information 

Human dialogue with databases and model bases has 
used a visual interface, which has typically featured 
a desktop metaphor. To date users have relied on the 
metaphor presented to them. A customized interface 
could still use a desktop metaphor to organize individual 
problem needs. Other options could be available and 
custom-developed, which might still rely on an inven- 
tory of choices or through some metaphor of choice. 
Specialized interfaces could be used depending on the 
problem type (routine, survival, change) to facilitate 
decision-making. Survival needs require that a user not 
be presented with too many choices, but rather accurate 
options to meet an immediate need. These just-in-time 
visual views present just the information and advice as 
needed (Lieberman, 2002). 
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Disseminating Information 

The dissemination function, involving the communica- 
tion and sharing of information and decision options 
with others, represents a critical system component of a 
PDSS. While routine problems relate to an individual, 
problems of survival and change require collaboration. 
Multi-point sharing of information facilitates deci- 
sion-making. As wireless becomes a standard feature 
in many technological devices, dissemination and 
communication increases for more people. Wireless 
may become an antiquated term as it becomes trans- 
parent and common. Information can be posted for 
everyone or particular audiences and can be edited or 
linked to other sources. Much of this information and 
collaboration may become routed through personal 
portals which structure the information for other users 
(Shambaugh, 2007). 



FUTURE TRENDS 
Future Design Metaphor 

One feature of a DSS includes the retrieval of informa- 
tion so that decisions can be made based on this infor- 
mation and other sources. Decisions are then based on 
existing data or data from the past. Goals of profit and 
cost reduction rely on what-if scenarios and simula- 
tions based on assumptions. The focus of individuals, 
however, is rarely on the past but on the present and the 
near future. Although the future cannot be predicted, 
trends based on past and current data provide a picture 
of where we are in our business, career, or personal life. 
Making decisions on what we want our life to be for 
ourselves, our families, and our communities, and even 
"what business are we in?" necessitates a different view 
that of future design, which is not about predicting 
the future but rather working towards a future based 
on our intent to continually cycle through rethinking, 
designing, and improving. 

Government and Community 

Responsibility for daily life has always been the do- 
main of the individual and the family. However, the 
historical reality is that daily life has been continuously 
constrained by institutions and governments, and by 
the unseen consequences of technological innovation. 



Much of daily life requires navigating these constraints 
and impacts. However, these tensions can be amelio- 
rated with a move towards taking advantage of personal 
insight and motivation, a belief in taking responsibility 
for our lives and our communities, and designing our 
technological tools for where we want to go, all features 
of a future design stance. 

Research Opportunities 

One avenue for research is to add more structure to 
unstructured data, including information from remote 
sources, locally-developed databases, and context 
information. How might these different sources of 
information be integrated and generalized for use by 
others? How might context be characterized in terms 
of re-usable objects? 

Modeling expertise has been a long-standing chal- 
lenge in AI. Modeling decisions for routine problems, 
those that can be characterized by rules or procedures, 
and use static domain models, have been the most 
successful. But a bigger question beyond What do we 
know? becomes How does the model update itself? 

Decision-making in survival situations will require 
customized model bases developed specifically for 
categories of extreme survival. In these type of situ- 
ations problems are unique and tools will need to be 
developed see how users ' beliefs about uncertainty and 
preferences on different outcomes can be visualized 
(Howard & Matheson, 1984). Evolutionary decision- 
making, decisions that impact long-term change, will 
require that model bases evolve from new data. Con- 
tinually re-defining expertise provides opportunities to 
analyze what people do on a daily basis (Gigerenzer, 
Todd, & ABC Research Group, 1999) and how daily, 
routine expertise becomes critical for individuals and 
groups of individuals. 

Furthermore, inquiry could be conducted on how 
informed citizens create new societies, epistemic 
cultures that are themselves creating new bodies of 
knowledge (Cetina, 1999). These new societies could 
be a block of families, an online community of individu- 
als, or physical neighbourhoods, cities, or countries, 
or geographic regions. The idea of a PDSS does not 
limit itself to an individual but to personalizing human 
life as tools to help individuals, neighborhoods, and 
cities grow (Longworth, 2006). The conundrum for 
researchers and designers is realizing that in designing 
systems that are less logical and more approximations 
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of the messiness of real life they may be helping hu- 
mans come to understand what it means to be human 
(Johnson, 2005). 

Another research avenue would study how users 
might determine the user interface, based on personal 
metaphors or specific needs, rather than reacting to a 
standardized metaphor. The study of mental models and 
how humans project meaning from their experience to a 
new experience might provide a new means to think and 
act beyond old rules (Fauconnier & Turner, 2002). Not 
all problems and situations require the same interface, 
particularly as the severity of the problem may require 
a design focused on immediacy and limited choice. 
Continued collaboration between AI researchers who 
study representation and reasoning, and those in Hu- 
man-Computer Interaction (HCI), in which interaction 
is addressed, may lead to intelligent interfaces with 
flexible planning, incorporation of human constraint 
issues (e.g., time, patience, attention, motivation, cog- 
nitive demands), and relevance of context (Lieberman 
& Selker, 2000). Such intelligent interfaces may find 
themselves first in wireless devices, such as PDAs. 



CONCLUSION 

Specific skills and responsibilities for living in the 21 st 
century have been pushed down to consumers by orga- 
nizations and governmental agencies. Individuals now 
require more time to make important decisions related 
to their personal and professional lives. These personal 
decisions add to the growing complexity of human 
living and require time and resources. Technological 
developments in computing, networking, and commu- 
nication provide humans with the capacity for making 
informed decisions. With the prospect of survival threats 
and long-term change, informed groups of citizens can 
initiate proactive priorities in their national, state, and 
local governments to address these potential problems. 
APDSS with features that enable communication and 
collaboration creates a tool to help individuals take 
responsibility for decision-making rather than relying 
on government and institutions. Personalized decision 
support, characterized by access to Internet resources, 
integrated knowledge bases, and personal computing 
and wireless communication, can provide humans with 
information and recommendations to solve problems, 
address emergencies, and enhance life. 
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KEY TERMS 

Change Problems: A type of problem with long- 
term consequences. 

Decision Support System (DSS): A computerized 
system which assists humans to make decisions. 

Epistemic Cultures: Bodies of knowledge devel- 
oped by individuals with a common need. 

Executive Information System (EIS): A decision 
support system that directly supports management 
decisions. 

Future Design: A means of looking and working 
towards the future rather than predicting the future. 

Personal Decision Support System (PDSS): A 

computerized decision support system which acquires 
information and organizes the information so that 
models of reasoning can produce recommendations 
for further information, resources, or action. Another 
feature of PDSS is its capacity to openly communicate 
organized information or decisions to others. 

Personal Portals: A computerized site which pro- 
vides a gateway other sites of individual interest. 

Routine Problems: A type of problem faced by 
individuals involving complexity of choices as well 
as short-term and long-term implications. 

Survival Problems : Atype of problem characterized 
by extreme impacts on individuals and communities. 
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INTRODUCTION 

Agents and Multi-Agent Systems (MAS) have become 
increasingly relevant for developing distributed and dy- 
namic intelligent environments. The ability of software 
agents to act somewhat autonomously links them with 
living animals and humans, so they seem appropriate 
for discussion under nature-inspired computing (Mar- 
row, 2000). This paper presents AGALZ (Autonomous 
aGent for monitoringALZheimerpatients), and explains 
how this deliberative planning agent has been designed 
and implemented. A case study is then presented, with 
AGALZ working with complementary agents into 
a prototype environment-aware multi-agent system 
(ALZ-MAS: ALZheimer Multi-Agent System) (Bajo, 
Tapia, De Luis, Rodriguez & Corchado, 2007). The el- 
derly health care problem is studied, and the possibilities 
of Radio Frequency Identification (RFID) (Sokymat, 
2006) as a technology for constructing an intelligent 
environment and ascertaining patient location to gener- 
ate plans and maximize safety are examined. 

This paper focuses in the development of nature- 
inspired deliberative agents using a Case-Based Rea- 
soning (CBR) (Aamodt & Plaza, 1994) architecture, 
as a way to implement sensitive and adaptive systems 
to improve assistance and health care support for 
elderly and people with disabilities, in particular 
with Alzheimer. Agents in this context must be able 
to respond to events, take the initiative according to 
their goals, communicate with other agents, interact 
with users, and make use of past experiences to find 
the best plans to achieve goals, so we propose the de- 
velopment of an autonomous deliberative agent that 



incorporates a Case-Based Planning (CBP) mechanism, 
derivative from Case-Based Reasoning (CBR) (Bajo, 
Corchado & Castillo, 2006), specially designed for 
planning construction. CBP-BDI facilitates learning 
and adaptation, and therefore a greater degree of au- 
tonomy than that found in pure BDI (Believe, Desire, 
Intention) architecture (Bratman, 1987). BDI agents 
can be implemented by using different tools, such as 
Jadex (Pokahr, Braubach & Lamersdorf, 2003), deal- 
ing with the concepts of beliefs, goals and plans, as 
Java objects that can be created and handled within the 
agent at execution time. 



BACKGROUND 

During the last three decades the number of Europeans 
over 60 years old has risen by about 50%. Today they 
represent more than 25% of the population and it is 
estimated that in 20 years this percentage will rise to 
one third of the population, meaning 100 millions of 
citizens (Camarinha-Matos & Afsarmanesh, 2002). This 
situation is not exclusive to Europe, since studies in 
other parts of the world show similar tendencies (Ca- 
marinha-Matos & Afsarmanesh, 2002). The importance 
of developing new and more reliable ways to provide 
care and support to the elderly is underlined by this 
trend (Camarinha-Matos & Afsarmanesh, 2002), and 
the creation of secure, unobtrusive and adaptable en- 
vironments for monitoring and optimizing health care 
will become vital. Some authors (Nealon & Moreno, 
2003) consider that tomorrow's health care institutions 
will be equipped with intelligent systems capable of 
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interacting with humans. Multi-agent systems and ar- 
chitectures based on intelligent devices have recently 
been explored as supervision systems for medical care 
for the elderly or Alzheimer patients, aimed to support 
them in all aspects of daily life, predicting potential 
hazardous situations and delivering physical and cog- 
nitive support. 

RFID technology is a wireless technology used 
to identify and receive information on the move. An 
RFID system contains basically four components: tags, 
readers, antennas and software (Sokymat, 2006). The 
configuration used in the system presented in this paper 
consists of 125KHZ transponders mounted on bracelets 
worn on the patient's wrist or ankle, several readers 
installed over protected zones, with up to 2 meters 
capture range, and a central computer where all the ID 
numbers sent by the readers is processed. 



MAIN FOCUS OF THE CHAPTER 

This article presents an autonomous planner agent for 
health care. The autonomous nature-inspired health 
care agent, named AGALZ, is presented. Then, a case 
study is presented, describing the main characteristics 
of ALZ-MAS architecture and its agents, including 
AGALZ, finalizing with initial results obtained after the 
implementation of a prototype into a real scenario. 

Autonomous Nature-Inspired Health 
Care Agent 

We have developed AGALZ, an autonomous delib- 
erative Cased-Based Planner (CBP-BDI) agent that 
integrates with other agents into a multi-agent system, 
named ALZ-MAS, as a proposal to improve the ef- 
ficiency of health care and supervision of patients in 
geriatric residences. AGALZ presents a deliberative 
architecture, based on the BDI (Belief, Desire, Inten- 
tion) model (Bratman, 1987). In this model, the internal 
structure and capabilities of the agents are based on 
human mental aptitudes, using beliefs, desires, and 
intentions. Our method facilitates the incorporation of 
CBR systems (Aamodt & Plaza, 1994) as a deliberative 
mechanism within BDI agents, facilitating learning and 
adaptation and providing a greater degree of autonomy 
than pure BDI architecture. A deliberative CBP-BDI 
agent is specialized in generating plans and incorpo- 
rates a Case-Based Planning (CBP) mechanism. The 



purpose of a CBR agents is to solve new problems by 
adapting solutions that have been used to solve similar 
problems in the past (Aamodt & Plaza, 1994), and the 
CBP agents are a variation of the CBR agents, based 
on the plans generated from each case. A CBP planner 
is used for AGALZ to find plans to give daily nursing 
care in a geriatric residence (Tapia, Bajo, Corchado, 
Rodriguez & Manzano, 2007). It is very important 
maintaining a map with the location of the different 
elements that take part in the system at the moment 
of planning or replanning, so using RFID technology 
facilitates enormously the dynamic planning. 

CBR is a type of human thinking based on reason- 
ing about past experiences. To introduce a CBR motor 
into a BDI agent it is necessary to represent the cases 
used in a CBR system by means of beliefs, desires and 
intentions, and implement a CBR cycle. A case is a past 
experience composed of three elements: an initial state 
or problem description that is represented as a belief; 
a final state that is represented as a set of goals and 
a solution (belief); and the sequence of actions that 
makes it possible to evolve from an initial state to a 
final state. This sequence of actions is represented as 
intentions or plans. Figure 1 shows the internal structure 
of a CPB-BDI agent. 

In a planner agent, the reasoning motor generates 
plans using past experiences and planning strategies, 
so the concept of Case-Based Planning is obtained 
(Corchado & Laza, 2003; Glez-Bedia & Corchado, 
2002). CBP consists of four sequential stages: retrieve 
stage to recover the most similar past experiences to 
the current one; reuse stage to combine the retrieved 
solutions in order to obtain a new optimal solution; 
revise stage to evaluate the obtained solution; and retain 
stage to learn from the new experience. 

The CBP cycle is implemented through goals and 
plans. When the goal corresponding to one of the 
stages is triggered, different plans (algorithms) can be 
executed concurrently to achieve the goal. Each plan 
can trigger new sub-goals and, consequently, cause 
the execution of new plans. Deliberative CBP-BDI 
agents, like AGALZ, are able to incorporate other 
reasoning mechanisms that can coexist with the CBP. 
AGALZ is an autonomous agent that can survive in 
dynamic environments. However, is possible to incor- 
porate communication mechanisms that allow it to be 
easily integrated into a multi-agent system and work 
coordinately with other agents to solve problems in a 
distributed way. 
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Figure 1. CBP-BDI Agent internal structure 
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The CBP planner constructs plans in such a way that 
a plan is a sequence of tasks that need to be carried out 
by a nurse. A task is a Java object that contains the date 
of the requested service, the description of the service 
and the time limits to carry it out. 

For each task one or more goals are established, 
in such a way that that the whole task is eventually 
achieved. A problem description will be formed by 
the tasks that the nurse needs to execute, the resources 
available, and the times assigned for their shift. In the 
retrieve stage, those problem descriptions found within 
a range of similarity close to the original problem de- 
scription are recovered from the beliefs base. In our 
case, a tolerance of 20% has been permitted. In order 
to do this, AGALZ allows the application of different 
similarity algorithms (cosine, clustering etc.). Once the 
most similar problem descriptions have been selected, 
the solutions associated with them are recovered. One 
solution contains all the plans (sequences of tasks) car- 
ried out in order to achieve the objectives of AGALZ 
for a problem description (assuming that replanning is 
possible) in the past, as well as the efficiency of the solu- 
tion being supplied. The chosen solutions are combined 



in the reuse stage to construct a plan (Bajo, Corchado 
& Castillo, 2006; Glez-Bedia & Corchado, 2002). The 
reuse is focused on the objectives and resources needed 
by each task, as well as on the objectives that the nurse 
needs to perform and the resources available in order to 
carry out the global plan. The objectives that each nurse 
has are aimed to attend the patients and not exceed eight 
nurse' working hours. The time available is a problem 
restriction. The resources necessary for some of the 
tasks are food, equipment and rooms, among others. 
AGALZ takes care of incidents and interruptions that 
may occur during replanning (Bajo, Corchado & Cas- 
tillo, 2006). Furthermore AGALZ trusts the nurse in the 
sense that the revision of a plan is made by the nurse. 
Finally, AGALZ learns about this new experience. If 
the evaluation of the plan is at least a 90% similar, the 
case is stored in the cases memory. 

Case Study 

A prototype of the system has been tested in several 
geriatric residences, which have been interested in 
improving the services offered to its patients and has 
collaborated in the development of the technology 
presented here, providing their know-how and experi- 
menting with the prototype developed. 

Figure 2 shows a basic schema of the wireless tech- 
nology implemented in the residences. We selected 30 
patients to test the system, so the hardware implemented 
basically consisted of 42 ID door readers, one on each 
door and elevator, 4 controllers, one at each exit, one 



Figure 2. ALZ-MAS wireless technology organization 
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in the first floor hall and another in the second floor 
hall, and 36 bracelets, one for each patient and the 
nurses. The ID door readers get the ID number from 
the bracelets and send the data to the controllers which 
send a notification to the Manager agent, located in a 
central computer. To test the system 30 Patient Agents, 
10 AGALZ Agents, 2 Doctor Agents and 1 Manager 
Agent were instantiated. 

ALZ-MAS: Alzheimer Health Care 
Multi-Agent System 

The characteristics of multi-agent systems make them 
appropriate for implementing into geriatric residences 
to improve health care of patients (Nealon & Moreno, 
2003). A multi-agent system is a distributed system 
based on the cooperation of autonomous agents. The 
relationships established between the agents of ALZ- 
MAS are inspired in human's behaviours (doctors, 
nurses, patients, etc.) (Marrow, 2000). 

Conclusions obtained after studying the require- 
ments of the problem are that ALZ-MAS is composed 
of four different agent types as shown in Figure 3: 

-Patient Agent manages the patient's personal data 
and behaviour (monitoring, location, daily tasks, and 
anomalies). Every hour validates the patient location, 
monitors the patient state and sends a copy of its memory 
base (patient state, goals and plans) to the Manager 
Agent in order to maintain backups. The patient state 
is instantiated at execution time as a set of beliefs and 
these beliefs are controlled through goals that must be 



achieved or maintained. The beliefs that were seen to 
define a general patient state at the Residences, were: 
weight, temperature, blood pressure, feeding, oral 
medication, parenteral medication, posture change, 
toileting, personal hygiene, and exercise. The beliefs 
and goals used for every patient depend on the plan 
(treatment) or plans that the doctors prescribe. Patient 
Agents monitors the patient state by means of the goals. 
It is necessary to maintain continuous communication 
with the rest of ALZ-MAS Agents, especially with 
AGALZ (through which the nurse can communicate 
the result of her assigned tasks). At least once per day, 
depending on the corresponding treatment, Patient 
Agents must communicate with AGALZ and Doctor 
Agents. Finally, Patient Agents must ensure that all 
actions indicated in the treatment are taken out. Patient 
Agents run on a central computer. 

Manager Agent plays two roles the security role 
that controls the patients' location and manages 
locks and alarms; and the Manager role that 
manages the medical record database and the 
doctor-patient and nurse-patient assignment. It 
must provide security for the patients and medical 
staff and the patients, doctors and nurse assign- 
ment must be efficient. This assignation is carried 
out through a CBR reasoning engine, which is 
incorporated within the Manager Agent. When 
a new assignation of tasks needs to be carried 
out to nurses or doctors, both past experiences, 
such as the profile of the nurse or doctor, and the 



Figure 3. ALZ-MAS architecture: Doctor, AGALZ, Patient, and Manager Agents, within their interactions 




\Mauaser AeeiH ^ 




'AGALZ Agent 



Patient Agent 




1319 



Planning Agent for Geriatric Residences 



needs of the current situation are recalled. In this 

way, tasks are allocated to nurses. A nurse profile 

includes nurse's preferences such as holidays, 

etc. 

Manager Agent runs on a central computer. 

Doctor Agent treats patients. It needs to interact 
with Patient Agents to order treatments and re- 
ceive periodic reports, with the Manager Agent 
to consult medical records and assigned patients, 
and with AGALZ agents to ascertain patients' 
evolution. 

AGALZ schedules the nurse's working day 
obtaining dynamic plans depending on the tasks 
needed for each assigned patient. AGALZ man- 
ages nurses' profiles, tasks, available time and 
resources. The generated plans must guarantee 
that all the patients assigned to the nurse are given 
care. Nurses can't exceed 8 working hours. Every 
agent generates personalized plans depending on 
the nurse's profile and working habits. AGALZ 
Agents run on mobile devices, where each nurse 
can see her plans task by task. A plan can be in- 
terrupted for different reasons: a resource fails; 
a patient suffers a crisis and requires unforeseen 
attention; a patient has an unexpected visit; etc. 

Extracting Results from ALZ-MAS 

Figure 4 shows the average number of nurses working 
simultaneously (each of the 24 hours of the day) before 



and after the implantation of the system prototype into 
a test residence, with data collected for 6 months. The 
average number of patients was the same before and 
after the implementation. Tasks executed by nurses were 
divided in two categories: direct action tasks (where 
nurses are in contact with patients) and indirect action 
tasks (where nurses are not directly involved with 
patients, like monitoring, written reports, managing 
personal visits to the patients, etc.). During the first 3 
months, the problem was analysed, the residence was 
observed and data was retrieved. Finally averages of the 
time spent by nurses in the carrying out of the tasks for 
every patient were obtained, having into account that a 
task depends on the dependency level of a patient and 
the nurse skill. For the direct action tasks, the following 
times were obtained for each patient: 35' cleaning, 18' 
feeding, 8' oral medication, 30' parenteral medication, 
25' posture change, 8' toileting, 60' exercise and 10' 
others. We are especially interested on time spent on 
indirect action tasks; daily average times obtained for 
every kind of task before and after the implementation 
for each task can be seen on Table 1. 

The system facilitates the more flexible assigna- 
tion of the working shifts at the residence; since the 
workers have reduced the time spent on routine tasks 
and can assign this time to extra activities. Their work 
is automatically monitored, as well as the patients' 
activities. The stored information may be analysed 
with knowledge discovery techniques and may help 
to improve the quality of life for the patients and the 
efficiency of the centre (Marrow, 2000). The security 



Figure 4. Number of nurses working simultaneously in the residence 
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Table 1. Time (minutes) spent on indirect tasks 



Monitoring 



Reports 



Visits Other TOTAL 



Before 
After 



167 
105 



48 
40 



73 
45 



82 
60 



370 
250 



of the centre has also been improved in two ways: the 
system monitors the patients and guarantees that each 
one of them is in the right place, and secondly, only 
authorised personnel can gain access to the residence 
protected areas. 



FUTURE TRENDS 

In the future, health care will require the use of new 
technologies that allow medical personnel to carry 
out their tasks more efficiently (Camarinha-Matos & 
Afsarmanesh, 2002). We are interested in the use of 
Ambient Intelligence (Ducatel, Bogdanowicz, Scapolo, 
Leijten & Burgelman), which provides a framework 
for the development of transparent, ubiquitous and 
unobtrusive environments. The objective of Ambient 
Intelligence is to adapt the existing technologies to the 
human necessities (Emiliani & Stephanidis, 2005). In 
this sense, the planner proposed in this work must be 
adapted to any other possible technologies an evaluated 
in similar environments. 



CONCLUSION 

We have shown the potential of deliberative AGALZ 
agents in a distributed multi-agent system focused 
on health care, providing a way to respond to some 
challenges of health care, related for example to the 
identification, control and health care planning. In ad- 
dition, the use of RFID technology (Sokymat, 2006) on 
people provides a high level of interaction among users 
and patients through the system and is fundamental in 
the construction of an intelligent environment. Further- 
more, the use of mobile devices, when used well, can 
facilitate social interactions and knowledge transfer. 
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KEY TERMS 

Ambient Intelligence (Ami): Refers to electronic 
environments that are sensitive and responsive to 
context and people needs and characteristics. It is 
characterized by systems and technologies that are 
embedded, context-aware, ubiquitous, non intrusive, 
personalized, adaptive and anticipatory. 

Case-Based Reasoning: Atype of reasoning based 
on the use of past experiences. The purpose of CBR 
systems is to solve new problems by adapting solu- 
tions that have been used to solve similar problems in 
the past. The main concept when working with CBR 
is the concept of case, which can be defined as a past 
experience. 

Case-Based Planning: A specialization of Case- 
Based Reasoning in which the solution proposed by 
the system is a plan (a sequence of actions). 

CBR-BDI: A deliberative BDI agent that incorpo- 
rates a CBR motor as reasoning mechanism. 

CBP-BDI: A deliberative BDI agent specialized in 
generating plans. It incorporates a Case-Based Plan- 
ning mechanism. 

Multi- Agent System: A system composed of several 
intelligent autonomous agents, collectively capable 
of reaching goals solving problems in a distributed 
way. 

Radio Frequency Identification: A wireless 
technology used to identify and receive information 
on the move. An RFID system contains basically four 
components: tags, readers, antennas and software. 
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INTRODUCTION 

Data mining has evolved from a need to make sense 
of the enormous amounts of data generated by orga- 
nizations. But data mining comes with its own cost, 
including possible threats to the confidentiality and 
privacy of individuals. This chapter presents a back- 
ground on privacy-preserving data mining (PPDM) 
and the related field of statistical disclosure limitation 
(SDL). We then focus on privacy-preserving estimation 
(PPE) and the need for a data-centric approach (DCA) 
to PPDM. The chapter concludes by presenting some 
possible future trends. 



BACKGROUND 

The maturity of information, telecommunications, 
storage and database technologies, have facilitated the 
collection, transmission and storage of huge amounts 
of raw data, unimagined until a few years ago. For 
raw data to be utilized, they must be processed and 
transformed into information and knowledge that have 
added value, such as helping to accomplish tasks more 
effectively and efficiently. Data mining techniques and 
algorithms attempt to aid decision making by analyzing 
stored data to find useful patterns and to build decision- 
support models. These extracted patterns and models 
help to reduce the uncertainty in decision-making 
environments. 

Frequently, data may have sensitive information 
about previously surveyed human subjects. This raises 
many questions about the privacy and confidentiality 
of individuals (Grupe, Kuechler, & Sweeney, 2002). 
Sometimes these concerns result in people refusing 
to share personal information, or worse, providing 
wrong data. 

Many laws emphasize the importance of privacy 
and define the limits of legal uses of collected data. In 



the healthcare domain, for example, the U.S. Depart- 
ment of Health and Human Services (DHHS) added 
new standards and regulations to the Health Insurance 
Portability and Accountability Act of 1996 (HIPAA) to 
protect "the privacy of certain individually identifiable 
health data" (HIPAA, 2003). Grupe et al. (2002, Exhibit 
1, p. 65) listed a dozen privacy-related legislative acts 
issued between 1970 and 2000 in the United States. 

On the other hand, these acts and concerns limit, 
either legally and/or ethically, the releasing of data- 
sets for legitimate research or to obtain competitive 
advantage in the business domain. Statistical offices 
face a dilemma of legal conflict or what can be called 
"war of acts." While they must protect the privacy 
of individuals in their datasets, they are also legally 
required to disseminate these datasets. The conflicting 
objectives of the Privacy Act of 1 974 and the Freedom 
of Information Act is just one example of this dilemma 
(Fienberg, 1994). This has led to an evolution in the 
field of statistical disclosure limitation (SDL), also 
known as statistical disclosure control (SDC). 

SDL methods attempt to find a balance between 
data utility (valid analytical results) and data security 
(privacy and confidentiality of individuals). In general, 
these methods try to either (a) limit the access to the 
values of sensitive attributes (mainly at the individual 
level), or (b) mask the values of confidential attributes 
in datasets while maintaining the general statistical 
characteristics of the datasets (such as mean, standard 
deviation, and covariance matrix). Data perturbation 
methods for microdata are one class of masking methods 
(Willenborg & Waal, 2001). 

Data Mining vs. Statistical Analysis 

Statisticians and researchers conduct surveys and col- 
lect datasets that are considered to be large when they 
contain a few hundred records (Hand, 1998). Traditional 
statistical techniques are the main (and the most suit- 
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able) tools for analyzing these datasets to make infer- 
ences and estimate population parameters. When the 
size of datasets is large, traditional statistical analysis 
techniques may not be the appropriate tools (Hand, 
1998, 2000; Hand, Blunt, Kelly, & Adams, 2000). First, 
traditional statistical analysis may be inappropriate 
because almost any small difference in a large dataset 
becomes statistically significant. Second, large datasets 
may suggest that data was not collected for inference 
(parameter estimation) about the population. Third, in 
businesses, a significant amount of data is generated 
because of unplanned activities (e.g., transactional 
databases) and not from planned activities (e.g., experi- 
ment or survey designs). Therefore, for large datasets, 
data mining becomes more appropriate. 

Examples of large datasets are abundant. Market- 
Touch, a company located in Georgia, USA, supports 
direct marketers with data and analytical tools (DMRe- 
view.com, 2004). It has a six-terabyte database called 
Real America Database (RADBO), which provides 
information about more than 93 million households 
and 200 million individuals. It is updated monthly with 
more than 20 million records. 

Statistical agencies also experience this phenomenon 
of rapidly growing datasets. The US Census Bureau 
(Census, 2001) reported that the Census 2000 data 
consist of "information about the 115.9 million hous- 
ing units and 281.4 million people across the United 
States. " These large sizes suggest the need for analytical 
tools that are suitable for large datasets, and again, data 
mining tools naturally come into play. Consequently, 
the Bureau provides programs with data mining ca- 
pabilities such as DataFerrett (Federated Electronic 
Research, Review, Extraction and Tabulation Tool), 
which can be used to analyze and extract data from 
TheDataWeb - a repository of datasets that cover more 
than 95 subject areas. 

Motivation for Privacy-Preserving Data 
Mining (PPDM) 

Data mining techniques may lead to more significant 
threats to privacy and confidentiality than statistical 
analysis. Domingo-Ferrer and Torra (2003) make a con- 
nection between SDL methods and some data-mining AI 
(artificial intelligence) tools and suggest that disclosure 
and re-identification threats can be magnified. 



DM tools can be used to aggregate or combine 
masked copies of a specific original dataset to reverse 
masking and re-build the original dataset, which raises 
a confidentiality issue. This is particularly true when 
unsophisticated SDL techniques are used and many 
masked copies are released. DM tools can also be used 
to enforce data integrity and consistency in distributed 
datasets by re-identifying different records belonging 
to the same individual raising a privacy issue. 

These concerns about privacy and confidentiality 
when DM tools are used have led to the birth of pri- 
vacy-preserving data mining (PPDM). The main goal 
of PPDM is to find useful patterns and build accurate 
models from datasets without accessing the individuals ' 
precise original values in records of datasets (Agrawal 
& Srikant, 2000). 

Related Work in Privacy-Preserving Data 
Mining (PPDM) 

Similar to the classification of data mining (DM) 
techniques proposed by Berry and Linoff (2004), 
privacy-preserving data mining (PPDM) techniques 
can be classified as: (a) directed PPDM techniques: 
privacy-preserving estimation and privacy-preserving 
classification, and (b) undirected PPDM techniques: 
privacy-preserving association rules and privacy-pre- 
serving clustering. 

Directed PPDM techniques try to model the relation- 
ship between a dependent variable and other (indepen- 
dent) variables in masked datasets. Estimation deals 
with continuous dependent variables and classification 
with categorical or binary dependent variables. The 
models obtained from the masked data using directed 
PPDM techniques must be the same (or similar) to 
that from the original dataset at the aggregate level, 
while protecting the privacy and confidentiality at the 
individual level. 

In undirected PPDM, there is no concept of a de- 
pendent variable. Instead, the goal is to find unknown 
patterns and rules. Clustering is used to discover (and 
usually profile) homogenous subsets of data records 
and often used as a preprocessing tool (to segment the 
customer base, for example) before applying other DM 
technique (Berry & Linoff, 2004). Association rules 
are used to discover which items go together (are as- 
sociated). Again, the goal of PPDM is to obtain similar 
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Figure 1. Privacy-preserving data mining PPDM literature 
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patterns from both the masked and original data. Figure 1 
, reproduced from Al-Ahmadi (2006), shows an abstract 
view of privacy-privacy data mining (PPDM) literature 
broken down by technique. Details on the references 
may be found in Al-Ahmadi (2006). 



PRIVACY-PRESERVING ESTIMATION 
(PPE) 

We focus on privacy-preserving estimation (PPE) (also 
called privacy-preserving regression). PPE is still in its 
infancy compared to other PPDM methods, with some 
approaches showing promise. Sanil et al. (2004) pro- 
posed an algorithm for computing the exact coefficients 
of multiple linear regression for vertzca//y-distributed 
(or partitioned) dataset without sharing original values. 
The dataset is assumed to contain a single shared, 
non-confidential dependent variable. The unshared 
confidential, independent variables are owned by more 
than two parties (agents) involved in the estimation 



process. It utilizes the secure summation algorithm 
(Benaloh, 1987; Clifton, Kantarcioglu, Vaidya, Lin, 
& Zhu, 2002) to share a statistical summary (total), 
populated partially by each party without revealing 
how much each party contributes to that statistic. This 
total is needed for estimating the regression coefficients 
iteratively. Thus, each party can calculate accurately, 
the coefficients of the variables they own and share 
them with other parties. 

Karr et al. (2004) suggest two approaches for 
building multiple linear regression on the union of a 
/iorzzonta//y-distributed dataset. The first approach, 
(secure data integration) integrates horizontally-dis- 
tributed datasets from multiple parties (agents) into one 
dataset, while protecting the identity of the data source. 
Each party could locally run linear regression analysis 
on the integrated dataset. This approach only protects 
the identity of the data sources (i.e. the identity of the 
involved parties, not the identity or confidentiality of 
surveyed human subjects). A second approach is based 
on the additive nature of the linear regression analysis., 
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Statistics (rather than data) needed to calculate the least 
squares estimators of linear regression coefficients are 
shared and integrated in a secure manner using the 
secure summation algorithm (Benaloh, 1987; Clifton 
et al., 2002; Schneier, 1996). 

Remote regression servers (cf . Duncan & Mukher- 
jee, 2000; Keller-McNulty & Unger, 1998; Schouten 
& Cigrang, 2003) are access-limitation (not masking) 
methods for protecting microdata for building linear 
regression models. Although this approach builds 
linear regression models using original values, users 
do not usually have any means of checking the fit 
of their models. Reiter (2003) proposed a method to 
overcome this limitation based on releasing artificial, 
simulated (marginally-wise) dependent and indepen- 
dent variables, residuals and fitted values that mimic 
the original relationships of the built models. 

Because many multivariate methods, including mul- 
tivariate linear regression, depend on matrix computa- 
tions such as matrix multiplication and matrix inverse, 
Du et al. (2004) proposed secure two-party matrix 
computations protocols. These enable two agents to col- 
laboratively run matrix computations without knowing 
or accessing the other party's original, sensitive values, 
and without the involvement of a third party. 

The above approaches to PPE, they are focused 
exclusively on linear relationships. This makes them 
somewhat limited for more general purpose PPE, where 
nonlinear relationships found in the original data may 
need to be preserved in the masked data. 

Data-Centric Approach (DCA) for 
Privacy-Preserving Data Mining 

One of the problems with many existing PPDM ap- 
proaches is that they create a dependency between the 
algorithm and the dataset (Thuraisingham, 2005); see, 
for example, Agrawal and Srikant (2000). The PPDM 
algorithm is usually a modification of a specific DM 
algorithm, for a specific protection technique. The 
masked data can therefore be analyzed using only that 
particular (tailored) data mining algorithm. Otherwise 
there is no guarantee that the results from analyzing the 
masked dataset will be the same as, or similar to, that 
from analyzing the original dataset. This is not a good 
idea because data miners usually employ more than one 
algorithm to mine a dataset. Examining all data mining 
algorithms, as well as modifying them, is not feasible. 



Second, once a dataset is released, there is no guarantee 
as to which algorithm might be applied possibly leading 
to incorrect conclusions and actions. 

Instead, as suggested by Al-Ahmadi et al. (2004), 
datasets should be protected or masked without refer- 
ence to a specific DM algorithm. Oliveira and Zai'ane 
(2004b) support the concept of a Data-Centric Approach 
(DCA) which supports the concept that the masking 
algorithm must not be tied to the data mining algorithm, 
but must be based on the characteristics of the data- 
set and its subsequent use. For example, a good PPE 
algorithm will mask the dataset based on the kind of 
relationships that need to be maintained in the masked 
dataset. However, it will not mandate that a particular 
data mining algorithm should be used to perform the 
estimation using the masked data. Al-Ahmadi (2006) 
demonstrates some PPE algorithms that utilize the DCA 
approach. Oliveira and Za'iane (2004a) also applied the 
DCA concept by developing a new PPDM clustering 
algorithm called Rotation-Based Transformation (RBT) 
that allows any distance-based clustering algorithms 
to be used on the masked datasets. 



FUTURE TRENDS 

Data perturbation and SDL masking methods can be 
a good starting point for implementing DCA in PPE 
and PPDM. One protection method used is Simple 
Additive Data Perturbation Method (SADP) (Traub, 
Yemini, & Wozniakowski, 1984), which has undesir- 
able characteristics in terms of data utility and data 
security (Muralidhar, Parsa, & Sarathy, 1999). Most 
of the newer and more sophisticated data perturbation 
and SDL masking methods, such as C-GADP (Sarathy, 
Muralidhar, & Parsa, 2002), IPSO (Burridge, 2003), 
EGADP (Muralidhar & Sarathy, 2005) and data shuf- 
fling (Muralidhar & Sarathy, 2003 , 2006), have not been 
investigated in the PPE and the general PPDM domain. 
The only exception is the GADP method (Muralidhar et 
al., 1999), which appears in a few privacy-preserving 
classification studies (Islam & Brankovic, 2004; Wilson 
& Rosen, 2002, 2003; Wilson, Rosen, & Al-Ahmadi, 
2005a, 2005b). Hence, there is a need to investigate 
the possibilities of using some of these advanced SDL 
masking methods in PPE and PPDM. 

From another perspective, different types of relation- 
ships can exist in a dataset. For instance, multivariate 
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normal datasets guarantee that all existing relationships 
among variables are linear. For this special case, some 
existing SDL masking methods are readily available 
and can perfectly preserve linear relationships. This is 
due to the fact that most SDL methods are developed to 
preserve linear relationships. However, most (business) 
datasets contain nonlinear relationships (Zhang, 2004), 
which can be monotonic or non-monotonic (Fisher, 
1970). "Atruth about data mining not widely discussed 
is that the relationships in data the miner seeks are 
either very easy to characterize, or very, very hard," 
(Pyle, 2003, p. 314). Therefore, there is a need to de- 
velop masking methods for PPE and PPDM to maintain 
more complicated types or relationships (i.e. monotonic 
nonlinear and non-monotonic relationships). 



CONCLUSION 

This article introduced privacy-preserving data mining 
(PPDM) and related concepts. It gave a brief over- 
view of the four main PPDM techniques: estimation, 
classification, clustering, and association rules. Then, 
it reviewed some of the work that has been done in 
Privacy-Preserving Estimation (PPE). It concluded 
by discussing some of the possible future trends in 
PPDM and PPE including the need for research into 
data-centric SDL-based masking techniques for solving 
complicated PPE problems. 
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KEY TERMS 

Confidentiality: The status accorded to specific 
attributes (such as salary) in datasets, whose original 
values should not be revealed. Generally, some type 



of protection such as masking must be provided before 
these confidential attributes are disseminated. 

Data Mining Algorithm: A systematic, practical 
method to implement a data mining technique. Dif- 
ferent algorithms can be used to implement the same 
data mining technique. For example, decision trees 
algorithms (CART, C4.5, C5, etc.) and logistic regres- 
sion are among the algorithms of the classification data 
mining technique. 

Data Mining Technique: The main purpose or 
objective of the data mining modelling process. Each 
technique can be implemented using different DM 
algorithms. 

Data-Centric Approach (DC A): The concept that 
data protection techniques must be independent of 
(standard) DM algorithms. That is, the masked data 
must be analyzable using multiple DM algorithms 
while providing results comparable to the results from 
analyzing the original data. 

Privacy: Privacy is the desire of individuals to 
control their personal information. Generally, in the 
SDL literature, it relates to the identity of an individual, 
while confidentiality relates to specific information 
about the individual (such as salary). 

Statistical Disclosure Limitation (SDL) or Statis- 
tical Disclosure Control (SDC): A set of methods that 
attempt to protect privacy and confidentiality of data, 
while preserving the overall statistical characteristics of 
original datasets (such as mean and covariance matrix) 
in the protected dataset. 
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INTRODUCTION 

Prediction of protein secondary structure (alpha-helix, 
beta-sheet, coil) from primary sequence of amino acids 
is a very challenging and difficult task, and the problem 
has been approached from several angles. A protein 
is a sequence of amino acid residues and can thus be 
considered as a one dimensional chain of 'beads' where 
each bead correspond to one of the 20 different amino 
acid residues known to occur in proteins. The length of 
most protein sequence ranges from 50 residues to about 
1000 residues but longer proteins are also known, e.g. 
myosin, the major protein of muscle fibers, consists of 
1800 residues (Altschul et al. 1997). Many techniques 
were used many researchers to predict the protein 
secondary structure, but the most commonly used 
technique for protein secondary structure prediction 
is the neural network (Qian et al. 1988). 

This chapter discusses a new method combining 
profile-based neural networks (Rost et al. 1993b), 
Simulated Annealing (SA) (Akkaladevi et al. 2005; 
Simons et al. 1997), Genetic algorithm (GA) (Akka- 
ladevi et al. 2005) and the decision fusion algorithms 
(Akkaladevi et al. 2005). Researchers used the neural 
network (Hopfield 1982) combined with GA and SA 
algorithms, and then applied the two decision fusion 
methods; committee method and the correlation meth- 
ods and obtained improved results on the prediction 
accuracy (Akkaladevi et al. 2005). Sequence profiles of 
amino acids are fed as input to the profile-based neural 
network. The two decision fusion methods improved 
the prediction accuracy, but noticeably one method 
worked better in some cases and the other method for 
some other sequence profiles of amino acids as input 
(Akkaladevi et al. 2005). Instead of compromising on 



some of the good solutions that could have generated 
from either approach, a combination of these two ap- 
proaches is used for obtaining better prediction accuracy. 
This criterion is the basis for the Bayesian inference 
method (Anandalingam et al. 1989; Schmidler et al. 
2000; Simons et al. 1997). The results obtained show 
that the prediction accuracy improves by more than 2% 
using the combination of the decision fusion approach 
and the Bayesian inference method. 



BACKGROUND 

A lot of interesting work has been done on protein sec- 
ondary structure prediction problem, and over the last 
10 to 20 years the methods have gradually improved 
in accuracy. The most successful application of neural 
networks (Hopfield 1 982) to secondary structure predic- 
tion was obtainedby Rost and Sander (Rost et al. 1 993b; 
Rost et al. 1993c; Rost 1996; Rost et al. 1994), which 
resulted in the prediction mail server called PHD (Rost 
et al. 1993c). Using profile-based neural network and 
a few other methods, the performance of the network 
is reported to be up to 67.2% (Rost et al. 1993b). 

In the problem of the protein secondary structure 
prediction, the inputs are the amino acid sequence 
profiles while the output is the predicted structure (also 
called conformation, which is the combination of alpha 
helices, beta sheets and loops) (Banavar et al. 2001; 
Branden et al. 1999). A typical protein sequence and 
its conformation class are shown below: 

ProteinSequence:ADADADADCCQQFFFAAAQQA- 

QQA 
Conformation Class: HHHH EEEE HHHHHHHH 
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H stands for Helical, E for Extended, and blanks 
are the remaining coiled conformations. 

Atypical protein contains about 32% alpha helices, 
2 1 % beta sheets and 47% loops or non-regular structure 
(Rost et al. 1 993b). It is possible to predict loop regions 
with higher accuracy than alpha helices or beta sheets 
(Rost et al. 1993c). The seven-fold cross-validation 
technique is used on the set of 126 non-homologous 
globular proteins from (Rost & Sander, 1 994), which is 
called the RS 1 26 data set (Rost et al. 1 994) for training 
and testing purpose. 

The protein secondary structure accuracy is cal- 
culated by using the three-state per-residue accuracy 
(Q 3 ), which gives the percentage of correctly predicted 
residues in either of the three states (classes), alpha 
helix, beta strand or loop region (Qian et al. II 
Rost 1996): 



Q 3 = 



(Pa+P*+Ploo P ) 



x!00% 



P a , P and P loo are number of residues predicted 
correctly in state alpha helix, beta strand and loop 
respectively while T is the total number of residues. 



PROTEIN SECONDARY STRUCTURE 
PREDICTION BY VARIOUS 
APPROACHES 

In this research the RSI 26 dataset is used, which 
contains 126 sequences with approximately more 
than 23,300 amino acid positions and 20 amino acids 
(Rost et al. 1994). Orthogonal encoding scheme is 
used for the input which is sent to the profile-based 
neural network. 

Protein Secondary Structure Prediction using se- 
quence profiles - The profile-based neural network is 
used for this research. Using profiles at the input level 
generally has been shown to yield better results than 
using profiles at the output level (Baldi et al. 1999; 
Rost et al. 1993b). Using this approach the secondary 
structure prediction accuracy (Q 3 ) is 66.8%. 

GA and the profile-based Neural Networks for 
protein secondary structure prediction - The predicted 
structure from the profile-based neural network is 
given to GA; the GA does a series of mutation and 
crossover operations on the predicted structure from 



the profile-based neural network to generate new solu- 
tions (offspring's) (Akkaladevi et al. 2005). After the 
offspring is generated; the fitness of this new offspring 
is calculated by again comparing to the true structure 
already known by using the Q 3 function. The GA ac- 
cepts or rejects this solution depending on the fitness 
value, which in this case is the prediction accuracy Q 3 . 
Finally at this point the error value is calculated and 
back-propagated to adjust the weights of the profile- 
based neural network. The mutation probability for GA 
in this research is set at 0.25, number of generation's 
at 75, population size at 30 and the crossover prob- 
ability as 100% (Akkaladevi et al. 2005). Using this 
approach the secondary structure prediction accuracy 
(Q 3 ) is 69.2%. 

SA and the profile-based Neural Networks for protein 
secondary structure prediction - The predicted structure 
from the profile-based neural network is sent to the S A 
algorithm for further processing by the SA algorithm 
(Akkaladevi et al. 2005). The SA algorithm generates 
new solutions and compares it with the true second- 
ary structure which is already known to calculate the 
prediction accuracy Q 3 . The error is than calculated by 
determining the value of Q 3 . This error value is then 
back-propagated to adjust the weights of the profile- 
based neural network. The starting temperature for 
SA in this research is set at 600, the final temperature 
at 0.20, the temperature cooling rate at 0.84, and the 
number of iterations per temperature at 20 (Akkaladevi 
et al. 2005). Using this approach the secondary structure 
prediction accuracy (Q 3 ) is 68.3%. 

Prediction of protein secondary structure using 
the Committee method and the profile-based Neural 
Network - In the committee based method (Mazurov 
et al. 1987) of applying decision fusion the secondary 
structure values are calculated using a combined pro- 
file-based neural network (PNN) with GA, a combined 
profile-based neural network with S A, and the indepen- 
dent profile-based neural network. The output obtained 
from the profile-based neural network, combined 
profile-based neural network plus GA and combined 
profile-based neural network plus SA is routed to the 
decision fusion algorithm, for fusing the solutions as 
shown in Figure 1 (Akkaladevi et al. 2005). 

The decision fusion (Abidi et al. 1992) algorithm 
works on the basis of a committee (committee method 
or voting method), where each individual in the 
committee decides on the best solution according to 
pre-determined rules and then cast their vote for the 
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Figure 1. Fusing the various solutions according to the fusion rules 
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best approach (Mazurov et al. 1987). In the event of 
a tie, the tie is broken by one more rule; the priority 
assigned to each algorithm. The algorithm with the 
highest priority wins. The Committee fusion algorithm 
is outlined below: 



Given a secondary structure output obtained by 
profile-based neural network of N elements, where 

z = 1,2, ,n. (Here for 'H' we assume a value 

of 2, for 'E' a value of 3, and for 'C a value of 

4. These are arbitrarily chosen values). Similarly 

represent output from GA and SA by G. and S. 

respectively 

Calculate the following values: 



2. 



3. 



G = &^-G,) 2 



s = Z(^-s,.) 2 



N=0 



(1) 



(2) 
(3) 



Compute N.- G.. If JV.- G. > 0, then (bin+) <r- N. 
- G else if JV.- G < 0, then (bin-) <-IV.- G , where 

i ii 7 v 7 i r 

bin+ and bin- are the so called positive and nega- 
tive bins. If the result of the operation is zero, it 
is not stored in any of the bins. 
Evaluate bin+ and bin-, the positive and negative 
bins for G; if they are equal or if the positive 
bin has a higher count compared to the negative 
bin G is assigned a positive sign (+G), else G is 



assigned a negative sign (-G). Always consider 
N=0. 

5. Repeat steps 3 and 4 to calculate S. 

6. Use max(N, G, S) to be the secondary structure for 
calculating Q 3 which is used to determine the error 
for back-propagation for weight adjustments. Each 
algorithm votes for the best solution by comparing 
its value with the other algorithms values. The 
algorithm with the majority votes wins the race. 
In the event of a tie, the tie is broken according to 
the algorithm's priority, and the algorithm which 
wins calculates the prediction accuracy using 
the function Q 3 to determine the error that is to 
be back-propagated to the profile-based neural 
network for weight adjustments. 

7. The profile-based neural network (PNN) sec- 
ondary structure values are assigned the highest 
priority, followed by the combination of profile- 
based neural network and GA (PNN+GA), and 
then followed by the combination of profile-based 
neural network and SA (PNN+SA) (Akkaladevi 
et al. 2005). Using this approach the secondary 
structure prediction accuracy (Q 3 ) is 70.8%. 

Prediction of protein secondary structure using 
the Correlation method and the profile-based Neural 
Network - This method is very similar to the committee 
method but with some minor changes (Akkaladevi et 
al. 2005; Ho et al. 1994). In this method the algorithm 
that wins after decision fusion is applied is used to cal- 
culate the prediction accuracy using the function Q 3 to 
determine the error that is to be back-propagated to the 
profile-based neural network for weight adjustments. 
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After this adjustment of weights on the profile-based 
neural network, the previous protein sequence is again 
used for testing purpose to check whether better predic- 
tion accuracy is achieved or not. Here the new weights 
are used if we get an improvement of more than 1.5%, 
otherwise from the previously calculated prediction 
accuracies of (PNN), (PNN+GA) and (PNN+SA), 
the method which produces the highest prediction 
accuracy is chosen to determine the error that is to be 
back-propagated to the profile-based neural network 
for weight adjustments (Akkaladevi et al. 2005). Us- 
ing this approach the secondary structure prediction 
accuracy (Q 3 ) is 71.4%. 



PREDICTION OF PROTEIN 
SECONDARY STRUCTURE BY THE 
BAYESIAN INFERENCE METHOD 

In this method the Bayesian inference method is ap- 
plied on the output generated by the committee and 
correlation methods of decision fusion ( Anandalingam 
et al. 1989; Schmidler et al. 2000). In the Bayesian 
inference approach both these methods are used by 
assigning a specific probability value to them, and then 
generating a new value using the Bayesian equation 
(Anandalingam et al. 1989; Simons et al. 1997). This 
new value obtained is used to decide between the two 
methods (committee method and correlation method) 
to be used for calculating the error that is to be back- 
propagated to the profile-based neural network for 
weight adjustments. The following Bayesian equation 
is used to calculate the value for judging between the 
two methods (Anandalingam et al. 1989). 



P(ff 1 | D) 



P(H 1 )xP(D|H 1 ) 



P(H X ) x P(D \H 1 ) + P(H 2 ) x P(D | H 2 ) 



To illustrate, let H 1 corresponds to correlation 
method, and H 2 corresponds to committee method. 
Since the correlation method produces better prediction 
accuracy compared to the committee method, for our 
first instance we assume that P^H^ = 0.51, and P(H 2 ) = 
0.49 (assigning more probability for choosing correla- 
tion method as this method produces better prediction 
accuracy compared to the committee method). 

For example if we obtain a prediction accuracy of 
71% using the correlation method and a prediction 



accuracy of 70.5% using the committee method, then 
P(D|H 1 ) = 0.71 and P(D|H 2 ) = 0.705. Bayesian equa- 
tion then yields: 



0.51x0.71 



0.51x0.71+0.49x0.705 



: 0.5117 



If the probability obtained is greater than or equal to 
0.5, the correlation method is used for calculating the 
error that is to be back-propagated to the profile-based 
neural network for weight adjustments. 

For example if we obtain a prediction accuracy of 
69% using the correlation method and a prediction 
accuracy of 72% using the committee method, then 
P(D|H 1 ) = 0.69 and P(D|H 2 ) = 0.72. Bayesian equation 
then yields: 



0.51x0.69 



0.51x0.69 + 0.49x0.72 



49.93 



If the probability obtained is less than 0.5, the com- 
mittee method is used for calculating the error to be 
back-propagated for weight adjustments. 

Similarly this new approach is tested using various 
values of probability for PiH^ and P(H 2 ), and always 
choosing P(H : ) greater than P(H 2 ). From the several 
test cases, it is concluded that the values of 0.506 
for P(H : ) and 0.494 for P(H 2 ) produce the greatest 
prediction accuracy. Using the Bayesian approach 
the prediction accuracy is obtained to be 73.3% (Q 3 ). 
This method produces the highest protein secondary 
structure prediction accuracy compared to all the other 
methods investigated in this research. 



SIMULATION RESULTS 

The simulations are performed using code written in 
JAVA on a 3.6 GHz Intel Pentium IV PC with hyper- 
threading running Microsoft Windows XP with 2GB 
of RAM and a 160GB hard disk. The multi-threading 
approach is used for running the GAand S A algorithms 
and the decision fusion methods in parallel. Table 1 
provides the summary of the prediction accuracies 
achieved using various approaches in this research. 

It is clearly evident from Table 1 that the Bayesian 
inference method improves the prediction accuracy 
by 2% compared to that of correlation method and 
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Table 1. Comparison of prediction accuracy (QJ for various approaches 



Approach Used 


Prediction Accuracy (Q J 


Profile-based Neural Network 


66.8% 


Profile-based Neural Network & GA 


69.2% 


Profile-based Neural Network & S A 


68.3% 


Decision fusion (Committee method) 
using Profile-based Neural Network 


70.8% 


Decision fusion (Correlation method) 
using Profile-based Neural Network 


71.4% 


Bayesian Inference method 


73.3% 



overall a prediction accuracy of 6.5% more than the 
profile-based neural network, which is a significant 
achievement. 



of using this approach is that, it does not comprise the 
advantages provided by either committee or correlation 
methods of decision fusion. 
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KEY TERMS 

Bayesian Inference: Bayesian inference is statis- 
tical inference in which evidence or observations are 
used to update or to newly infer the probability that a 
hypothesis may be true. 

Decision Fusion: The process of combining clas- 
sifiers is called decision fusion. Results from different 
methods, algorithms, sources or classifiers can often 
be combined (fused) to give estimates of a better qual- 
ity than could be obtained from any of the individual 
sources alone. 

Genetic Algorithm: Genetic Algorithms (GAs) are 
adaptive heuristic search algorithm premised on the 
evolutionary ideas of natural selection and genetic. The 
basic concept of GAs is designed to simulate processes 
in natural system necessary for evolution, specifically 
those that follow the principles first laid down by Charles 
Darwin of survival of the fittest. As such they represent 
an intelligent exploitation of a random search within a 
defined search space to solve a problem. 

Neural Network: ANeural Network is an informa- 
tion processing paradigm that is inspired by the way 
biological nervous systems, such as the brain, process 
information. The key element of this paradigm is the 
novel structure of the information processing system. It 
is composed of a large number of highly interconnected 
processing elements (neurons) working in unison to 
solve specific problems. 

Profile-Based Neural Network: This type of 
neural network configuration results when we feed the 
multiple alignments in the form of a sequence profile 
(for each position an amino acid frequency vector is 
fed to the network) instead of a base sequence to a 
neural network. 

Protein: Alarge molecule composed of one or more 
chains of amino acids in a specific order determined by 
the base sequence of nucleotides in the DNA coding 
for the protein. 

Secondary Structure: In biochemistry and struc- 
tural biology, secondary structure is the general three- 
dimensional form of local segments of biopolymers 
such as proteins and nucleic acids (DNA/RNA). 

Simulated Annealing Algorithm: Simulated an- 
nealing (SA) is a generic probabilistic meta-algorithm 
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for the global optimization problem, namely locating a 
good approximation to the global optimum of a given 
function in a large search space. 
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INTRODUCTION 



BACKGROUND 



Bioinformatics has become an important tool to sup- 
port clinical and biological research and the analysis 
of functional data, is a common task in bioinformatics 
(Schleif, 2006). Gene analysis in form of micro array 
analysis (Schena, 1 995) and protein analysis (Twyman, 
2004) are the most important fields leading to multiple 
sub omzcs-disciplines like pharmacogenomics, glyco- 
proteomics or metabolomics. Measurements of such 
studies are high dimensional functional data with few 
samples for specific problems (Pusch, 2005). This leads 
to new challenges in the data analysis. Spectra of mass 
spectrometric measurements are such functional data 
requiring an appropriate analysis (Schleif, 2006). Here 
we focus on the determination of classification models 
for such data. In general, the spectra are transformed 
into a vector space followed by training a classifier 
(Haykin, 1 999). Hereby the functional nature of the data 
is typically lost. We present a method which takes this 
specific data aspects into account. A wavelet encoding 
(Mallat, 1999) is applied onto the spectral data leading 
to a compact functional representation. Subsequently 
the Supervised Neural Gas classifier (Hammer, 2005) 
is applied, capable to handle functional metrics as 
introduced by Lee & Verleysen (Lee, 2005). This al- 
lows the classifier to utilize the functional nature of the 
data in the modelling process. The presented method 
is applied to clinical proteome data showing good re- 
sults and can be used as a bioinformatics method for 
biomarker discovery. 



Applications of mass spectrometry (ms) in clinical 
proteomics have gained tremendous visibility in the 
scientific and clinical community (Villanueva, 2004) 
(Ketterlinus, 2005). One major objective is the search 
for potential classification models for cancer studies, 
with strong requirements for validated signal patterns 
(Ransohoff, 2005). Primal optimistic results as given in 
(Petricoin, 2002) are now considered more carefully, be- 
cause the complexity of the task of biomarker discovery 
and an appropriate data processing has been observed 
to be more challenging than expected (Ransohoff, 
2005). Consequently the main recent work in this field 
is focusing on optimization and standardisation. This 
includes the biochemical part (e.g. Baumann, 2005), 
the measurement (Orchard, 2003) and the subsequently 
data analysis (Morris, 2005)(Schleif 2006). 



PROTOTYPE BASED ANALYSIS IN 
CLINICAL PROTEOMICS 

Here we focus on classification models. A powerful 
tool to achieve such models with high generalization 
abilities is available with the prototype based Super- 
vised Neural Gas algorithm (SNG) (Villmann, 2002). 
Like all nearest prototype classifier algorithms, SNG 
heavily relies on the data metric d, usually the standard 
Euclidean metric. For high-dimensional data as they 
occur in proteomic patterns, this choice is not adequate 
due to two reasons: first, the functional nature of the 
data should be kept as far as possible. Second the noise 
present in the data set accumulates and likely disrupts 
the classification when taking a standard Euclidean 
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approach. A functional representation of the data with 
respect to the used metric and a weighting or pruning of 
especially (priory not known) irrelevant function parts of 
the inputs, would be desirable. We focus on a functional 
distance measure as recently proposed in (Lee, 2005) 
referred as functional metric. Additionally a feature 
selection is applied based on a statistical pre-analysis 
of the data. Hereby a discriminative data representation 
is necessary. The extraction of such discriminant fea- 
tures is crucial for spectral data and typically done by a 
parametric peak picking procedure (Schleif, 2006). This 
peak picking is often spot of criticism, because peaks 
may be insufficiently detected and the functional nature 
of the data is partially lost. To avoid these difficulties 
we focus on a wavelet encoding. The obtained wavelet 
coefficients are sufficient to reconstruct the signal, still 
containing all relevant information of the spectra, but 
are typically more complex and hence a robust data 
analysis approach is needed. The paper is structured 
as follows: first the bioinformatics methods are pre- 
sented. Subsequently the clinical data are described 
and the introduced methods are applied in the analysis 
of the proteome spectra. The introduced method aims 
on a replacement of the classical three step procedure 
of denoising, peak picking and feature extraction by 
means of a compact wavelet encoding which gives a 
more natural representation of the signal. 



BIOINFORMATIC METHODS 

The classification of mass spectra involves in general the 
two steps peak picking to locate and quantify positions 
of peaks and feature extraction from the obtained peak 
list. In the first step a number of procedures as baseline 
correction, denoising, noise estimation and normal- 
ization are applied in advance. Upon these prepared 
spectra the peaks have to be identified by scanning all 
local maxima. The procedure of baseline correction 
and recalibration (alignment) of multiple spectra is 
standard, and has been done here using ClinPro Tools 
(Ketterlinus, 2006). As an alternative we propose a 
feature extraction procedure preserving all (potentially 
small) peaks containing relevant information by use of 
the discrete wavelet transformation (DWT). The DWT 
has been done using the Matlab Wavelet-Toolbox (see 
http://www.mathworks.com). Due to the local analysis 
property of wavelet analysis the features can still be 
related back to original mass position in the spectral 



data which is essential for further biomarker analysis. 
For feature selection the Kolmogorov-Smirnoff test 
(KS-test) (Sachs, 2003) has been applied. The test was 
used to identify features which show a significant (p < 
0.01) discrimination between the two groups (cancer, 
control). In (Waagen, 2003) also a generalization to a 
multiclass experiment is given. The now reduced data 
set has been further processed by SNG to obtain a clas- 
sification model with a small ranked set of features. 
The whole procedure has been cross-validated in a 
10-fold cross validation. 



WAVELET TRANSFORMATION IN MASS 
SPECTROMETRY 

Wavelets have been developed as powerful tools (Rie- 
der, 1998) used for noise removal and data compres- 
sion. The discrete version of the continuous wavelet 
transform leads to the concept of a multi-resolution 
analysis (MRA). This allows a fast and stable wavelet 
analysis and synthesis. The analysis becomes more 
precise if the wavelet shape is adapted to the signal to 
be analyzed. For this reason one can apply the so called 
bi-orthogonal wavelet transform (Cohen, 1992), which 
uses two pairs of scaling and wavelet functions. One 
is for the decomposition/analysis and the other one 
for reconstruction/synthesis, giving a higher degree 
of freedom for the shape of the scaling and wavelet 
function. In our analysis such a smooth synthesis pair 
was chosen. It can be expected that a signal in the 
time domain can be represented by a small number of 
a relatively large set of coefficients from the wavelet 
domain. The spectra are reconstructed in dependence 
of a certain approximation level L of the MRA. The 
denoised spectrum looks similar to the reconstruction 
as depicted in Figure 1 . 

One obtains approximation- and detail-coefficients 
(Cohen, 1 992). The approximation coefficients describe 
a generalized peak list, encoding primal spectral in- 
formation. For linear MALDI-TOF spectra a device 
resolution of 500-800Da canbe expected. This implies 
limits to the minimal peak width in the spectrum and 
hence, the reconstruction level of the Wavelet- Analysis 
should be able to model corresponding peaks. A level 
L = 4 is appropriate for our problem (see Figure 1). 
Applying this procedure including the KS-test on the 
spectra with an initial number of 22306 measurement 
points per spectrum one obtains 602 wavelet coefficients 
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Figure 1. Wavelet reconstruction of the spectra with L = 4, 5, x-mass positions, y-arbitrary unit. Original signal 
- solid line. One observes for L = 5 (right plot) the peak approximate is to rough. 





used as representative features per spectrum, still al- 
lowing a reliable functional representation of the data. 
The coefficients were used to reconstruct the spectra 
and the final functional representation of the signal. 



PROTOTYPE CLASSIFIERS 

Supervised Neural Gas (SNG) is considered as a repre- 
sentative for prototype based classification approaches 
as introduced by Kohonen (Kohonen, 1995). Different 
prototype classifiers have been proposed so far (Koho- 
nen, 1995) (Sato, 1996) (Hammer, 2005) (Villmann, 
2002) as improvements of the original approach. The 
SNG has been introduced in (Villmann, 2002) and 
combines ideas from the Neural Gas algorithm (NG) 
introduced in (Martinetz, 1993) with the Generalized 
learning vector quantizer (GLVQ) as given in (Sato, 
1996). 

Subsequently we give some basic notations and 
remarks to the integration of alternative metrics into 
Supervised Neural Gas (SNG). Details on SNG includ- 
ing convergence proofs can be found in (Villmann, 
2002). Let us first clarify some notations: Let c v in L be 
the label of input v, L a set of labels (classes). Let Vin 
R DV be a finite set of inputs v. LVQ uses a fixed number 
of prototypes (weight vectors, codebook vectors) for 
each class. Let W=fw r j be the set of all codebook 
vectors and c be the class label of w . Furthermore, 

r r 

let Wc=fwjc r = cj be the subset of prototypes as- 
signed to class cinL. The task of vector quantization 
is realized by the map !Fas a winner-take-all rule, i.e. 
a stimulus vector v in Vis mapped onto that prototype 
s the pointer w s of which is closest to the presented 
stimulus vector v, measured by a distance d x (v,w). d x 
(v,w) is an arbitrary differentiable similarity measure 



which may depend on a parameter vector X. For the 
moment we take X as fixed. The neuron s (v) is called 
winner or best matching unit. If the class informa- 
tion of the weight vector is used, the above scheme 
generates decision boundaries for classes (details in 
(Villmann, 2002)). A training algorithm should adapt 
the prototypes such that for each class c in L, the cor- 
responding codebook vectors Wc represent the class 
as accurately as possible. Detailed equations and cost 
function for SNG are given in (Villmann, 2002). Here 
it is sufficient to keep in mind that in the cost function 
of SNG the distance measure can be replaced by an 
arbitrary (differentiable) similarity measure, which 
finally leads to new update formulas for the gradient 
descent based prototype updates. 

Incorporation of a functional metric to SNG As 
pointed out before, the similarity measure d x (v,w) is 
only required to be differentiable with respect to X and 
w. The triangle inequality has not to be fulfilled neces- 
sarily (Hammer, 2005). This leads to a great freedom in 
the choice of suitable measures and allows the usage of 
non-standard metrics in a natural way. For spectral data, 
a functional metric would be more appropriate as given 
in (Lee, 2005). The obtained derivations can be plugged 
into the SNG equations leading to SNG with a functional 
metric, whereby the data are functions represented by 
vectors and, hence, the vector dimensions are spatially 
correlated. Common vector processing does not take 
this spatial order of the coordinates into account. As a 
consequence, the functional aspect of spectral data is 
lost. For proteome spectra the order of signal features 
(peaks) is due to the nature of the underlying biological 
samples and the measurement procedure. The masses 
of measured chemical compounds are given ascending 
and peaks encoding chemical structures with a higher 
mass follow chemical structures with lower masses. 
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In addition, multiple peaks with different masses may 
encode parts of the same chemical structure and, hence, 
are correlated. Lee proposed an appropriate norm with 
a constant sampling period r: 
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are respectively of the triangles on the left and right 
sides of x.. Just as for Lp, the value of p is assumed to 
be a positive integer. At the left and right ends of the 
sequence, x Q and x D are assumed to be equal to zero. 
The derivatives for the functional metric taking p = 2 
are given in (Lee, 2005). Now we consider the scaled 
functional norm where each dimension (0, 1], v. is scaled 
by a parameter X. > and all X. sum up to 1: 
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And A fc = ^ fc — fusing this parameterization one 
can emphasize/neglect different parts of the function 
for classification. 



ANALYSIS OF PROTEOMIC DATA 

The proposed data processing scheme is applied to 
clinical ms spectra taken from a cancer study (45 cancer, 
50 control samples). Sample preparation and profile 
spectra analysis were carried out using the CLINPROT 
system (Bruker Daltonik, Bremen, Germany [BDAL]). 
The preprocessed set of spectra and the corresponding 
wavelet coefficients are then analyzed using the SNG 



extended by a functional metric. We reconstructed the 
spectra based upon the discriminative wavelet coef- 
ficients determined by the Kolmogorov-Smirnoff test 
as explained above and used corresponding intensities 
as features. We used all features for the parameterized 
functional norm i.e. all X. = 1 . The original signal with 
approx. 22000 sampling points had been processed with 
only 600 remaining points still encoding the significant 
parts of the signal relevant for discrimination between 
the classes. The SNG classifier with functional metric 
obtains a crossvalidation accuracy of 84% using func- 
tional metric and 82% by use of standard Euclidean 
metric. The results from the wavelet processed spectra 
are slightly better than using standard peak lists, with 
81% crossvalidation accuracy. 



FUTURE TRENDS 

The proposed method generates a compact but still 
complex functional representation of the spectral data. 
While the bior3 .7 wavelet gives promising results they 
are still not optimal, due to signal oscillations, leading 
to negative intensities in the reconstruction. Further, 
the functional nature of the data motivates the usage of 
a functional data representation and similarity calcula- 
tion but there are also spectra regions encoded which 
do not contain meaningful biological information but 
measurement artefacts. In principle it should be possible 
to remove this overlaying artificial function from the 
real signal. Further it could be interesting to incorporate 
additional knowledge about the peak width, which is 
increasing over the mass axis. 



CONCLUSION 

The presented interpretation of proteome data demon- 
strate that the functional analysis and model generation 
using SNG with functional metric in combination with 
a wavelet based data pre-processing provides an easy 
and efficient detection of classification models. The 
usage of wavelet encoded spectra features is espe- 
cially helpful in detection of small differences which 
maybe easily ignored by standard approaches as well 
as to generate a significant reduced number of points 
needed in further processing steps. The signal must 
not be shrinked to peak lists but could be preserved in 
its functional representation. SNG was able to process 
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high-dimensional functional data and shows good regu- 
larization. By use of the Kolmogorov-Smirnoff test we 
found a ranking of the features related to mass positions 
in the original spectrum which allows for identifica- 
tion of most relevant feature dimensions and to prune 
irrelevant regions of the spectrum. Alternatively one 
could optimize the scaling parameters of the functional 
norm directly during classification learning by so called 
relevance learning as shown in (Hammer, 2005) for 
scaled Euclidean metric. Conclusively, wavelet spectra 
encoding combined with SNG and a functional metric 
is an interesting alternative to standard approaches. It 
combines efficient model generation with automated 
data pre-treatment and intuitive analysis. 
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KEY TERMS 

Bioinformatics: Generic term of a research field as 
well as a set of methods used in computational biology 
or medicine to analyse multiple kinds of biological or 
clinical data. It combines the disciplines of computer 
science, artificial intelligence, applied mathematics, 
statistics, biology, chemistry and engineering in the 
field of biology and medicine. Typical research subj ects 
are problem adequate data pre-processing of measured 
biological sample information (e.g. data cleaning, align- 
ments, feature extraction), supervised and unsupervised 
data analysis (e.g. classification models, visualization, 
clustering, biomarker discovery) and multiple kinds of 
modelling (e.g. protein structure prediction, analysis of 
expression of gene, proteins, gene/protein regulation 
networks/interactions) for one or multidimensional 
data including time series. Thereby the most common 
problem is the high dimensionality of the data and the 
small number of samples which in general make stan- 
dard approach (e.g. classical statistic) inapplicable. 

Biomarker: Mainly in clinical research one goal 
of experiments is to determine patterns which are 
predictive for the presents or prognosis of a disease 
state, frequently called biomarker. Biomarkers can 
be single or complex (pattern) indicator variables 
taken from multiple measurements of a sample. The 
ideal biomarker has a high sensitivity, specificity and 
is reproducible (under standardized conditions) with 
respect to control experiments in other labs. Further it 
can be expected that the marker is vanishing or chang- 
ing during a treatment of the disease. 



Clinical Proteomics: Proteomics is the field of 
research related to the analysis of the proteome of 
an organism. Thereby, clinical proteomics is focused 
on research mainly related to disease prediction and 
prognosis in the clinical domain by means of proteome 
analysis. Standard methods for proteome analysis are 
available by Mass spectrometry. 

Mass Spectrometry: An analytical technique used 
to measure the mass-to-charge ratio of ions. In clinical 
proteomics mass spectrometry can be applied to extract 
fingerprints of samples (like blood, urine, bacterial 
extracts) whereby semi-quantitative intensity differ- 
ences between sample cohorts may indicate biomarker 
candidates 

Prototype Classifiers: Are a specific kind of neural 
networks and related to the kNN classifier. The clas- 
sification model consists of so called prototypes which 
are representatives for a larger set of data points. The 
classification is done by a nearest neighbour classifica- 
tion using the prototypes. Nowadays prototype classi- 
fiers can be found in multiple fields (robotics, character 
recognition, signal processing or medical diagnosis) 
trained to find (non)linear relationships in data. 

Relevance Learning: A method, typically used in 
supervised classification, to determine problem specific 
metric parameter. With respect to the used metric and 
learning schema univariate, correlative and multivari- 
ate relations between data dimensions can be analyzed. 
Relevance learning typically leads to significantly 
improved, problem adapted metric parameters and 
classification models. 

Wavelet Analysis: Method used in signal process- 
ing to analyse a signal by means of frequency and 
local information. Thereby the signal is encoded in a 
representation of wavelets, which are specific kinds of 
mathematical functions. The Wavelet encoding allows 
the representation of the signal at different resolutions, 
the coefficients contain frequency information but can 
also be localized in the signal. 
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INTRODUCTION 

Proposed in 1962, the Hough transform (HT) has 
been widely applied and investigated for detecting 
curves, shapes, and motions in the fields of image 
processing and computer vision. However, the HT has 
several shortcomings, including high computational 
cost, low detection accuracy, vulnerability to noise, 
and possibilityofmissingobjects. Many efforts target 
at solving some of the problems for decades, while the 
key idea remains more or less the same. Proposed in 
1989 and further developed thereafter, the Random- 
ized Hough Transform (RHT) manages to considerably 
overcome these shortcomings via innovations on the 
fundamental mechanisms, with random sampling in 
place of pixel scanning, converging mapping in place 



of diverging mapping, and dynamic storage in place 
of accumulation array. This article will provides an 
overview on advances and applications of RHT in the 
past one and half decades. 



BACKGROUND 

Taking straight line detection as an example, the upper 
part of Fig. 1 shows the key idea of the Hough Transform 
(HT) (Hough, 1962) . A set of points on a line jy=&x+& 
in the image are mapped into a set of lines across a point 
(k, b) in the parameter space. A uniform grid is located 
on a window in the (k, b) space, with an accumulator 
a(k, b) at each bin. As each point (x,y) on the image is 
mapped into a line in the (k, b) space, every associated 
accumulator a(k, b) is incremented by 1. We can detect 



Figure 1. From hough transform to randomized hough transform 
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lines by finding every accumulator with it's score a(k, 
b) larger than a given threshold. 

The Hough Transform was brought to the atten- 
tion of the mainstream image processing community 
by Rosenfeld (1969). Then Duda and Hart (1972) not 
only introduced the polar parameterization technique for 
more efficient line detection, but also demonstrated how 
a circle can be detected. Kimme, Ballard and Sklansky 
(1975) made circular curve detection significantly more 
effective by using the gradient information of pixels. 
Merlin and Faber (1975) showed how the HT could 
be generalized to detect an arbitrary shape at a given 
orientation and a given scale. Ballard (1981) eventually 
generalized the HT to detect curves of a given arbitrary 
shape for any orientation and any scale. Since then, a 
lot of applications, variants and extensions of the HT 
have been published in the literature. A survey on these 
developments of the HT is given by Illingworth and 
Kittler(1988). 

However, the HT has several critical drawbacks 
as follows: 

a. All pixels are mapped, and every bin in the grid 
needs an accumulator. If there are d parameters, 
each represented by M bins or grid points, one 
needs M d accumulators. 

b. To reduce the computational cost, quantization 
resolution cannot be high, which blurs the peaks 
and leads to low detection accuracy. 

c. Each pixel activates every accumulator located on 
a line, but there is only one that represents the cor- 
rect one while all the others are disturbances. 

d. If the grid window is set inappropriately, some 
objects may locate outside the window and thus 
cannot be detected. 

e. Disturbing and noisy pixels cause many interfer- 
ing accumulations. 

Many efforts have been made to alleviate these 
problems. Using the gradient information of pixels 
is one of them. Another is analyzing noise and error 
sensitivity (van Veen, 1981; Brown, 1983; Grimson & 
Huttenlocher, 1990). The third is the use of hierarchical 
voting accumulation (Li, Lavin & LeMaster, 1 986) or 
multiresolution (Atiquzzaman, 1992). Yet another is 
improving the effect of quantization through the use 
of kernels (Palmer, Petrou, & Kittler, 1993) or error 
propagation analysis (Ji & Haralick, 2001), as well as 
hypothesis testing (Princen, Illingworth, & Kittler, 



1994). However, none of these suggestions offer any 
fundamental changes to the key mechanisms of HT. 

Proposed in 1989 and further investigated there- 
after (Xu, Oja, & Kultanen, 1990; Xu & Oja, 1993), 
the Randomized Hough Transform (RHT) tackles the 
above problems by using a fundamental innovation: 
the one-to-many diverging mapping from the image 
space to the parameter (accumulator) space, as shown 
in the upper part of Fig. 1 (a), is replaced by a many-to- 
one converging mapping, as shown in the bottom part 
of Fig. 1(a). This fundamental change further enables 
several joint improvements, such as a random sam- 
pling in place of pixel scanning, a small size dynamic 
storage in place of the array of M d accumulators, and 
an adaptive detection in place of enumerating all the 
pixels and picking those accumulators with scores 
larger than a threshold. As a result, not only time and 
storage complexity have been reduced significantly, 
but also the detection accuracy has been improved 
considerably. 

Subsequently, many studies have been made on RHT. 
On one hand, there are various real applications such 
as medical images (Behrens, Rohr, & Siegfried, 2003), 
range images (Ding, et al, 2005), motion detection 
(Heikkonen, 1995), object tracking for a mobile robot 
(Jean & Wu, 2004), soccer robot (Claudia, Rous, 
&Kraiss,2004), mine detection (Milisavljevic, 1999), 
and others (Chutatape & Guo, 1999). On the other 
hand, there are also many further developments on 
RHT, including an efficient parameterization for ellipse 
detection (McLaughlin, 1998), extension to motion 
detections (Kalviainen, Oja, & Xu, 1991; Xu, 2007), 
the uses of local gradient information, local connectiv- 
ity and neighbor-orientation for further improvements 
(Brailovsky, 1999; Kalviainen & Hirvonen, 1997), 
an integration with error propagtion analysis (Ji & 
Xie, 2003), a modification of random sampling to 
importance sampling (Walsha & Raftery, 2002), and 
others (Xu, 2007). Due to space limit, it is not possible 
to provide a complete survey here. An early review 
on RHT variants is referred to (Kalviainen, Hirvonen, 
Xu, & Oja, 1995), and recent elaborations on RHT 
are referred to (Xu, 2007). 

It may also need to be mentioned that the literature 
on RHT studies often includes studies under the name 
of probabilistic HT (Bergen & Shvaytser, 1991; Kiry- 
ati, Eldar & Bruckstein, 1991) that also suggests to 
use a random sampling to replace the scanning in the 
implementation of the standard HT and thus shares one 
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of the previously mentioned RHT features. However, 
it will not lose too much generality to regard it as a 
degenerated case of RHT for an understanding purpose, 
though there are some detailed differences. 



BASIC RHT MECHANISMS AND 
CHARACTERISTICS 

As shown in Fig. 1 , one pixel is mapped into all the 
points on a line passing (k,b) by the diverging map- 
ping mechanism of HT, which actually incurs the above 
drawbacks (a)-(e). RHT replaces this mechanism with 
a converging mapping mechanism such that two or 
more pixels are picked to jointly determine a line, 
i.e., mapped into one point (k,b). By this mechanism, 
different points on the same line y=kx+b will hit the 
same point (k,b), without creating a great number of 
false accumulations. Also, the feature of being mapped 



into one point at a time makes it possible to construct 
accumulators dynamically, with no need of laying a 
grid on a pre-specified window. We only need to ac- 
cumulate a(k, b) at those locations activated by the 
converging mappings. Also, quantization resolution 
may vary for different locations, and each quantiza- 
tion bin can be replaced by a kernel. As a result, the 
drawbacks (b),(c),(d) no longer exist. 

Without considering the quantization effect, if 
there is a line consisting of n pixels on an image, we 
get a peak with n counts in its accumulated scores. 
Assume that in its neighbour there is another peak of 
false line consisting of m < n pixels, then the ratio n/m 
describes a signal/noise ratio of a reliable detection by 
HT. In RHT, assuming that we exhaust all the pos- 
sible pairs of pixels, the voting counts for the line will 
be n(n-l)/2 while the voting counts for the disturbing 
false line will be m(m-l)/2, i.e., the signal/noise ratio 

times increased compared 



becomes ^ribir that is tH 



Table 1. Missing probability versus false alarm probability 



1 HE DETECTING RULE 



On an image that consists of N pixels 
Detect a point #e© as a line if il is hit by more than k t] times. 



MISSING PROBABILITY 



FALSE-ALARM PROBABILITY 



Consider a line consisting of n pixels, 
a trial of randomly sampling two pixels 
has a probability ti(n-i) 

Pc — .V(.V-l) 

that both pixels come from the line, i.e., 
the line is successfully hit in the 
parameter space by a probability p c . 
After M trials, the number of being 
successfully hit is a variable £ in a 
binomial distribution. 

see eqn.(4b) in (Xu & Oja, 1993). 

A risk of missing this line by the 
detecting rule has a probability 

ft=0 

controlling it below a pre-specified 
rate, we can determine a lower bound 



Consider a false line consisting of m 
pixels, a trial of randomly sampling two 
pixels has a probability _ m{m-\) 

Pr ~ .V(.V-1) 

that both pixels come from the line, i.e., 
the line is successfully hit in the 
parameter space by a probability P r . 
After M trials, the number that it is hit 
is a variable^ in a binomial distribution 
too, 

p,(£=k)=<iit(i-p,r* 

In this case, taking a point #€0 that is 
hit by more than A„ times as a line has a 
risk of taking a false line as a solution 
with a probability 



i o 



controlling it below r a pre-specified i 
we can determine an upper bound 

M<AC 



ate, 



1345 



Randomized Hough Transform 



to HT. Thus, the above problem (e) can also be sig- 
nificantly improved. 

In fact, it is not necessary to exhaust all the possible 
pairs of pixels for RHT to detect lines. Via randomly 
sampling two pixels for a converging mapping, we 
only need to have a small fraction of all the possible 

pairs to get the degree m(m _i) with a high probability, 
which solves the above problem (a) with a significant 
reduction in both time and space complexities. A more 
precise explanation is given in Tab. 1 . We detect a point 
0g@ as a line if it is hit by more than k Q times, with 
a risk of missing this line by a small probability P . . 
Controlling it below a pre-specified rate, we need to 
only run M> M c trails. On the other hand, controlling 
probability r| r of taking a false line as a solution, we 
can determine an upper bound M < M r . Even if a line is 
falsely detected, it can be later discarded by evaluating 
all the detected lines via the actual pixels on the im- 
age. Thus, a large r| r will not affect the performance 
too much, but will only waste computing time. 



RHT GENERAL FORM AND 
EXTENSIONS 

In general, RHT is applicable to a curve that can be 
expressed in a parametric equation f (x,y,0) = with a 
number k of free parameters . Solving the j oint equations 
f (x.,y.,6) = 0, i = 1,..., k yields a converging mapping 
into a point 0e@. A general algorithmic form is given 
in Tab. 2. 

We can obtain variants and extensions by modify- 
ing either one or more of the first four steps in Tab. 2. 
First, the converging mapping in Step 1 can be altered 
by varying either the way of getting samples, or the 
way of computing 0e@ from these samples, or both. 
Instead of random sampling, samples can be obtained 
by searching a candidate solution in S via local connec- 
tivity and neighbor-orientation (Kalviainen, Hirvonen, 
Xu, & Oja, 1995; Brailovsky, 1999; Kalviainen & 
Hirvonen, 1997) or by importance sampling (Walsha 
& Raftery, 2002). Instead of solving joint equations, 



Table 2. The general RHT in algorithmic form 



Given k Q and computing M c as in Tab J. Let the set of candidate 
curves S# be initially empty and set a pre-specified number 
t & for the number of candidate curves. 

Step 1: Randomly sample a number of pixels and implement a 

converging mapping into a point 6^G( M ) t 
Step 2: Check whether there is already an accumulator a(0) with 

8-6 or g N et where N# denotes a neighbourhood of 8: 

• if yes, set a(8 mw ) = a{8) + 1, 8 n ™ =a&+(\- a)&, a > and 
delete the old a(Q), 

• otherwise, set 0(8*™) - 1. 

Step 3: Check all the accumulators, if there is one a(8) > k , then put 
the corresponding 8 into S . as a candidate solution; 

Step 4: If the number of candidate solutions in S e is larger than k# 7 
examine every candidate 8 g S# to see whether there are 
enough image pixels that can be reasonably expressed by 

• if yv$? refine by these pixels as a confirmed solution 
and then remove the pixels from the image; 

■ otherwise, simply discard this 8. 

Step 5; t4—t\^ift>M c , then stop; otherwise go to Step L 
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as discussed in (Xu, 2007), a solution can also be ob- 
tained by either a least square fitting, anL norm fitting, 
or by maximum likelihood estimation. Sometimes, it 
may even consider under-constrained equations by 
taking less samples, from which a parametric curve or 
surface in is obtained to implement an array based 
accumulation similar to HT. 

Second, there are also alternatives for Step 2 and 
Step 3. One extreme is returning to an array based 
accumulation. The other extreme is that all the mapped 
points in are stored as they are, and either cluster 
analysis or kernel based density estimation is made 
on them to find cluster centres and density degrees for 
detecting curves or objects. Between the two extremes, 
we may consider a trade off or their combination (Xu, 
2007). Third, Step 4 can also be performed with dif- 
ferent choices, including a 5-band test, a fitting error 
threshold, and a hypothesis testing (Xu, 2007). 

Moreover, instead of checking candidate solution 
every time t, we can let the procedure run until t = M c , 
put those accumulators with a(0) > k Q into S as can- 
didate solutions and examine these candidates at Step 
4. Also, checking and examining candidates can be 
made per a pre-specified period. Furthermore, gradi- 
ent information in a grey image may also improve the 
converging mapping. 

The last but not the least, RHT has also be extended 
to detect objects by a template as shown in Fig.2. 



FUTURE TRENDS 

Challenges to RHT mainly come from the effects of 
noise and quantization. Two types of noise are shown 
in Fig.3. The first type is in Fig. 3(a) with disturbing 
pixels added but the original pixels unaffected. This 
noise type may reduce the signal/noise ratio, resulting 
in more computing time and space. However, the 
accuracy of the detected line will be not affected. The 
second type is in Fig. 3(b), with some original pixels 
deviated from the exact line. The quantization effect 
can be regarded as a special case of this type that 
uniformly distributed noise is added to the coordinates 
of pixels. The second type not only reduces the signal/ 
noise ratio but also makes the detected line inaccurate. 
As yet, there lacks a systematic theoretical analysis 
on how the solution accuracy will be affected by this 
second type. More importantly, theoretical guides are 
lacking on how to control the accuracy of detected 
curves and objects. 

The tasks of detecting curves and objects can also 
be performed from the perspective of mixture based 
learning, which is much more robust in the case of 
the second type of noise (Xu, 2003; Liu, Qiao, & Xu, 
2006; Xu, 2007). Solving pattern recognition tasks by 
machine learning approaches is a popular trend in the 
past decade and currently. Actually, the machine learn- 
ing perspective are complementary to the perspective 



Figure 2. Use a template to match a shape via translation fi, rotation § and scaling X 
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each pixel n(.\\y) on the image 
relates to 3 corresponding point v on 
the template via 

R{$) is u rotdiun matrix of angle $, 

CONVERGING MAPPING : 
randomly sampling m > 2 pairs of 

(m.v) with u from the image and V 
from the template, the parameters 

X ? $ , /i can be solved jointly by 

ffl>2 matrix equations for each pair. 

Alternatively, setting A = \ h the 
motion of an object can be delected 
by the displacement //and the 
20 rotation $ after a lime interval. 
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Figure 3. Different effects by two types of noises 



Randomized Hough Transform 
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(b) 



of HT/RHT type evidence accumulation. A trend is 
integrating the strengths of both. 



CONCLUSION 

This article provides not only a brief overview on nearly 
two decade developments and applications of RHT for 
detecting curves, shapes, and motions, but also a tutorial 
and re-elaboration on basic mechanisms, variants, and 
extensions of RHT, as well as challenges and future 
trends of RHT studies. Recently, a general problem 
solving paradigm has been developed and implemented 
by an integration of five essential mechanisms (Xu, 
2007). Not only the difference between the machine 
learning perspective and HT/RHT perspective can 
be understood via handling two coupled core tasks, 
namely amalgamating evidences and discriminating 
differences, but also different implementations of these 
mechanisms and differences in a specific integration 
may bring us new results and potential directions for 
future studies. 



ACKNOWLEDGMENT 

The work is supported by Chang Jiang Scholars Program 
by Chinese Ministry of Education for Chang Jiang 
Chair Professorship in Peking University. 



REFERENCES 

Atiquzzaman, M.,(1992), Multiresolution Hough 
transform-an efficient method of detecting patterns 
inimages, IEEE Transactions Pattern Analysis Machine 
Intelligence 14,1090-1095. 



Ballard, D.H., (1981), Generalizing the Hough trans- 
form to detect arbitrary shapes, Pattern Recognition, 
13(2),111-122. 

Behrens, T., Rohr, K., & Stiehll , S., H., (2003), Robust 
Segmentation of Tubular Structures in 3D Medical 
Images by Parametric Object Detection and Tracking, 
IEEE Transactions on Systems, Man, and Cybernetics- 
Part B: Cybernetics, 33(4),554-561. 

Bergen, J.R., & Shvaytser, H., (1991), A probabilistic 
algorithm for computing Hough transforms, Journal 
of Algorithms 12,639-656. 

Brailovsky, V., (1999), Fast and robust techniques for 
detecting straight line segments using local models, 
Pattern Recognition Letters 20,865-877. 

Brown, CM., (1983), Inherent bias and noise in the 
Hough transform, IEEE Transactions Pattern Analysis 
Machine Intelligence 5,493-505. 

Chutatape, O., & Guo,L.,(1999), A modified Hough 
transform for line detection and its performance, Pat- 
tern Recognition 32,181-192. 

Claudia, G., Rous, M., & Kraiss, K.F., (2004), Real 
Time Adaptive Colour Segmentation for the RoboCup 
Middle Size, RoboCup2004, LNAI3276, Springer, 
402-410. 

Ding, Y.H., et al, (2005), Range image segmentation 
based on randomized Hough transform, Pattern Rec- 
ognition Letters 26,2033-2041. 

Duda, R.O., & Hart, P.E., (1972), Use of the Hough 
transform to detect lines and curves in pictures, Com- 
munications of the ACM 15(1), 11-15. 

Grimson, W.E.L.& Huttenlocher, D.P., (1990), On the 
sensitivity of the Hough transform for object recogni- 



1348 



Randomized Hough Transform 



tion, IEEE Transactions Pattern Analysis Machine 
Intelligence 12,255-274. 

Heikkonen, J., (1995), Recovering 3D motion param- 
eters from optical flow field using randomized Hough 
transform, Pattern Recognition Letters 15,971-978. 

Hough, P.V.C., (1962), Method and means for recog- 
nizing complex patterns, U.S. Patent 3069654, Dec. 1 8, 
1962. 

Illingworth, J. & Kittler, J., (1988), A survey of the 
Hough Transform, Computer Vision Graphics and 
Image Processing 43, 221-238. 

Illingworth, J. & Kittler, J., (1987), The adaptive 
Hough Transform, IEEE Transactions Pattern Analysis 
Machine Intelligence 9,690-698. 

Kimme,C.D., Ballard,D.H.,& Sklansky, J.,(1975), 
Finding circles by an array of accumulators, Com- 
munications of the ACM 18(2), 120-122. 

Jean, J.H.. & Wu, T., (2004), Robust visual servo 
control of a mobile robot for object tracking in shape 
parameter space, 43rd IEEE Decision & Control Con- 
ference, 4016-4021. 

Ji, Q. & Xie, Y., (2003), Randomised Hough transform 
with error propagation for line and circle detection, 
Pattern Analysis and Application 6,55-64. 

Ji, Q. & Haralick, R.Q., (2001), Error propagtion 
for Hought Transform, Pattern Recognition Letters 

22,813-823. 

Kalviainen, H. & Hirvonen, P., (1 997), An extension to 
the randomized Hough transform exploiting connectiv- 
ity, Pattern Recognition Letters, 18(1), 77-85. 

Kalviainen, H., Hirvonen, P., Xu, L. & Oja, E., (1995), 
Probabilistic and nonprobabilistic Hough transforms: 
Overview and comparison, Image Vision Computing 
13,239-252. 

Kalviainen, H., Oja, E., & Xu, L., (1991), Motion 
Detection Using Randomized Hough Transform, 
Proceedings 7th Scandinavian Conference on Image 
Analysis, 72-79. 

Kiryati, N., Eldar, Y., & Bruckstein, A.M., (1991), A 
probabilistic Hough transform, Pattern Recognition, 
24(4): 303-316. 



Li, Z, Lavin, M.A., LeMaster, R.J., (1986), Fast Hough 
transform: a hier-archical approach, Computer Vision, 
Graph Image Processing 36,139-161. 

Liu, Z.Y., Qiao, H. , & Xu, L. , (2006), Multisets 
Mixture learning based Ellipse Detection, Pattern 
Recognition, 39,731-735. 

McLaughlin, R.A., (1998), Randomized Hough trans- 
form: improved ellipse detection with comparison, 
Pattern Recognition Letters 19(3-4), 299-305. 

Milisavljevic, N., ( 1 999), Comparison of three methods 
for shape recognition in the case of mine detection, 
Pattern Recognition Letters 20(11-13), 1079-1083. 

Olson, C.F., (1999), Constrained Hough transforms for 
curve detection, Computer Vision and Image Under- 
standing, 73(3),329-345. 

Merlin, P.M.& Farber, D.J.,(1 975), A parallel mecha- 
nism for detecting curves in pictures, IEEE Transactions 
Computer 24, 96-98. 

Palmer, PL., Petrou, M., & Kittler,!, (1993), AHough 
transform algorithm with a 2D hypothesis testing ker- 
nel, Computer Vision, Graphics, and Image Processing: 
Image Understanding 58(2),22 1-234. 

Princen, J., Illingworth, J., & Kittler, J., (1994), 
Hypothesis testing: A framework for analyzing and 
optimizing Hough transform performance, IEEE 
Transactions Pattern Analysis Machine Intelligence 
16(4),329-341. 

Risse, T., (1989), Hough Transformation for line rec- 
ognition: complexity of evidence accumulation and 
cluster detection, Computer Vision Graphics and Image 
Processing 46, 327-345. 

Rosenfeld, A. , ( 1 969), Picture Processing by Computer, 
Academic Press, New York. 

Shapiro, S.D., & Iannino, A., (1979), Geometric cons- 
tructions for predicting Hough transform performance, 
IEEE Transactions Pattern Analysis Machine Intel- 
ligence 1(3),3 10-3 17. 

Walsha, D, & Raftery, A.E., (2002), Accurate and 
efficient curve detection in images: the importance 
sampling Hough transform, Pattern Recognition 
35,1421-1431. 



1349 



Randomized Hough Transform 



vanVeen, T.M., & Groen, RCA. (1981), Discretiza- 
tion errors in the Hough transform, Pattern Recognition 
14(1-6):137-145. 

Xu, L (2007), A unified perspective and new results 
on RHT computing, mixture based learning, and multi- 
learner based problem solving, Pattern Recognition 
40,2129-2153. 

Xu, L (2003), Data smoothing regularization, multi- 
sets-learning, and problem solving strategies, Neural 
Networks 16,817-825. 

Xu, L., & Oja, E., (1993), Randomized Hough 
Transform (RHT): Basic Mechanisms, Algorithms and 
Complexities, Computer Vision, Graphics, and Image 
Processing: Image Understanding 57, 131-154. 

Xu, L. Oja, E., & .Kultanen, P., (1990), A New 
Curve Detection Method Randomized Hough transform 
(RHT), Pattern Recognition Letters 11,331-338. 



KEY TERMS 

5 Band Test: A pixel is said to fall in the 5 band of 
p (it denotes a curve or surface ) in the image space if 
the shortest distance from this pixel to p is less than a 
pre-specified threshold 5. Pixels falling in the 8 band 
of p are regarded as belonging to p, and a 8 band test 
can be designed according to these pixels. 

Cluster Analysis: Beyond using an accumulation 
array, in the cases of a converging mapping, every 
mapped point in R K is memorized. After an enough 
number of converging mappings, we get a set of points 
on which cluster analyses can be made to find clusters' 
centre (mean or median). 



Diverging Mapping vs. Converging Mapping: 

Given pixels of a number m, a set of under-constrained 
equations specify a curve or manifold of a dimension > 
k - m in R K if m < k. E.g., from a line y=kx+b pass- 
ing a given pixel in the image, we have a line h-y-kx 
in R 2 . This case is called diverging mapping because 
m pixels are mapped diversely to the R K space. On the 
other hand, if m > k, a unique point in the R K space 
maybe determined by solving a set of joint equations 
or optimizing a cost when the joint equations are over- 
constrained, i.e., we have a converging mapping that 
maps m pixels into one point in R K . 

Kernel Estimator: Every mapped point is memo- 
rized as the centre of a kernel function, e.g., abell-shaped 
such as a Gaussian. Collectively, mapped points forms a 
density estimation for a multi-mode distribution, with 
each mode in place of the above cluster centre. 

Random Sampling: Given a set of iVpixels, we take 
a number m of pixels with each picked randomly with a 
probability l/N. Repeating this sampling by an enough 
number of times, a global configuration of iVpixels will 
emerge, without enumerating all the N pixels. 

Threshold Based Voting vs. Local Maxima Find- 
ing: Given a pre-specified threshold, an accumulator 
in an array is picked if it receives votes larger than 
the threshold, without considering any neighborhood. 
Finding a local maximum means to find an accumulator 
with its votes larger than those of accumulators located 
in its neighborhood area. 

Under-Constrainedvs.Over-ConstrainedEqua- 
tions: For a parametric equation of k free parameters, 
we have a set of under-constrained equations with 
pixels of a number m < k and a set of over-constrained 
equations with pixels of a number m > k in a non- 
degenerate way. 
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INTRODUCTION 

Ranking functions have been introduced under the 
name of ordinal conditional functions in Spohn (1988; 
1 990). They are representations of epistemic states and 
their dynamics . The most comprehensive and up to date 
presentation is Spohn (manuscript). 



BACKGROUND 

The literature on knowledge, belief, and uncertainty in 
artificial intelligence is divided into two broad classes. 
In epistemic logic (Hintikka 1961, Halpern & Fagin & 
Moses & Vardi 1 995), belief revision theory (Alchour- 
ron & Gardenfors & Makinson 1985, Gardenfors 1988, 
Rott 2001), and nonmonotonic reasoning (Kraus & 
Lehmann & Magidor 1 990, Makinson 2005) qualitative 
approaches are used to represent the epistemic state 
of an agent. In probability theory (Pearl 1988, Jeffrey 
2004) and alternatives (Dempster 1968, Shafer 1976, 
Dubois & Prade 1 988) epistemic states are represented 
quantitatively as degrees of belief rather than yes-or-no 
beliefs (see Halpern 2003 for an overview). One of the 
distinctive features of ranking functions is that they are 
quantitative, but nevertheless induce a notion of yes- 
or-no belief that satisfies the standard requirements of 
rationality, viz. consistency and deductive closure. 



1. p(W) = 

2. p(0) = oo 

3. p(Au£) = min{p(A), p(B)} 

If the field of propositions A is closed under count- 
able intersection (if A 1 e A, . . ., A n e A, . . ., n e N, then 
A^n.. .nA n n.. . e A) so that A is a a-field, a ranking 
function p on A is countably minimitive if and only if 
it holds for all propositions A 1 e A,. . . A n e A, . . . 

4. p(A 1 u...uAu...) = min{p(A 1 ), ...,p(A), ...} 

If the field of propositions A is closed under arbi- 
trary intersection (if BcA, then nB e A) so that A 
is a y-field, a ranking function p on A is completely 
minimitive if and only if it holds for all sets of proposi- 
tions BcA: 

5. p(uB) = min{p(A): A e B} 

A ranking function p on A is regular just in case 
p(A) < oo for each non-empty or consistent proposi- 
tion A in A. 

The conditional ranking function p(-|-)' Ax A — » 
Nu { oo } based on the ranking function p on A is defined 
such that for all propositions A, B in A: 

6. p(A\B) = p(An£) - p(B) if A * 0, and p(0\B) = 



RANKING FUNCTIONS 

Let W be a non-empty set of possibilities or worlds, 
and let A be a field of propositions over W. That is, 
A is a set of subsets of W that includes the empty set 
(0 e A) and is closed under complementation 
with respect to W (if A e A, then W\A e A) and finite 
intersection (if A e A and B e A, then AnB e A). A 
function p from the field A over W into the natural 
numbers N extended by oo, p: A^iVu{oo}, is a (finitely 
minimitive) ranking function on A if and only if for all 
propositions A, B in A: 



p(-\B) is a ranking function on A, for each proposi- 
tion B in A. 

A function k from the set of worlds W into the 
natural numbers N, k: W^> N, is dipointwise ranking 
function on W if and only if k(w) = for at least one 
world w in W. Each pointwise ranking function k on 
W induces a regular and completely minimitive rank- 
ing function p K on every field of propositions A over 
W by defining 

7. p (A) = min{K(w): w e A} (= oo if A = 0) 
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Ranking Functions 



Huber (2006) discusses under which conditions a 
ranking function on a field of propositions A induces 
a pointwise ranking function on the underlying set of 
worlds W. 

The rank of a proposition A, p(A), represents the 
degree to which an agent with ranking function p dis- 
believes A. If p(A) = 0, the agent does not disbelieve 
A. However, this does not mean that she believes A. 
She may well suspend judgment and neither disbelieve 
A nor its complement or negation W\A (in this case 
p(A) = p(W\A) = 0). Rather, belief in a proposition 
is characterized by disbelief in its negation: an agent 
with ranking function p: A — » JVu{oo} believes A e 
A if and only if p(W\A) > 0. The belief set Bel of an 
agent with ranking function p: A — » iVu{oo} is the set 
of all propositions she believes: 

Bel = {A e A: p(W\A) > 0} 



point of view the converse is true as well. However, 
first we have to discuss how an epistemic agent is 
to update her ranking function when she learns new 
information. 



UPDATE RULES 

A theory of epistemic states is incomplete if it does not 
account for the way the epistemic states are updated 
when the agent receives new information. As there 
are different formats in which the agent may receive 
new information, there are different update rules. The 
simplest and most unrealistic case is that of the agent 
becoming certain of a new proposition. This case is 
covered by 

Plain Conditionalization 



The axioms of ranking theory require an agent to not 
disbelieve both a proposition and its negation - i.e. at 
least one of A, W\A has to be assigned rank 0. Thus an 
agent with ranking function p: A — » iVu{oo} believes 
A g A if and only if p(W\A) > p(A). For a given p: 
A — » JVu{oo}, this suggests to define the belief func- 
tion induced by p, P : A — » Zu{±oo}, such that for all 
propositions A in A: 

P p (A) = p(W\A)-p(A) 

P assigns positive numbers to the propositions that 
are believed, negative numbers to the propositions that 
are disbelieved, and to those propositions and their 
negations with respect to which the agent suspends 
judgment. As a consequence, 



Bel p = {A e A: p (A) > 0} 



Bel is consistent and deductively closed in the finite 
sense, for every ranking function p on A. That is, nB 
^ for every finite B cz Bel ; and A e Bel if there is 
a finite B c Bel such that nB c A, for any A e A. If 
p: A — » iVu{oo} is countably/completely minimitive, 
Bel is consistent and deductively closed in the fol- 
lowing countable/complete sense: nB ^ for every 
countable/arbitrary B cz Bel ; and A e Bel if there is 
a countable/arbitrary B c Bel such that nB c A, for 
any A e A. As will be seen below, from a diachronic 



If the agent's epistemic state at time t is represented 
by the ranking function p on A, and if between t and 
t\ the agent becomes certain of the proposition E e A 
and of no logically stronger proposition E+ czE, E+ e 
A, then the agent's epistemic state at time t' should be 
represented by the ranking function p ' = p(-\E) on A. 

We usually do not learn by becoming certain of a 
proposition, though. In most cases the new information 
merely changes the strength of our beliefs in various 
propositions. This is illustrated by a variation of an ex- 
ample due to Jeffrey (1983). Let our agent be interested 
in the color of the carpet of her hotel room. At time t, 
before checking in, she neither believes nor disbelieves 
any of the following three hypotheses: the carpet is 
beige {beige), the carpet is brown {brown), the carpet 
is black {black). However, she is certain that the carpet 
is either beige or brown or black. The relevant part of 
her ranking function at time t thus looks as follows: 
p{beige) = p{not beige) = p{brown) = p{not brown) = 
p{black) = p{not black) = p{beige or brown or black) 
= 0, p{neither beige nor brown nor black) = qo. 

At time t', after checking in and when opening the 
door to her room, it appears to the agent that the carpet 
is rather dark. As a consequence she now believes that 
the carpet is either brown or black. But since it is late 
at night, the curtains are closed, and she has not turned 
on the light yet, she cannot tell whether the carpet is 
brown or black. Her ranks for the relevant propositions 
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thus change to the following values: p' (beige) = p'(not 
brown) = p' (not black) = 1, p y (not beige) = p'(Z?row/z) 
= p' (black) = 0. 

A change in the strength of the agent's beliefs about 
the color of the carpet will affect the strength of her 
beliefs about the color of, say, the furniture in the hotel 
room. For instance, at time t, our agent is pretty confi- 
dent that the hotel room does not have dark furniture 
if the carpet is brown - and similarly if the carpet is 
black. She is also pretty confident that the hotel room 
has dark furniture if the carpet is beige. The relevant 
part of her ranking function at time t looks as follows: 
p(dark\brown) = p(dark\black) = 3, p(dark\beige) = 0. 
This implies that, at time t, the agent neither believes 
the furniture is dark nor that it is not dark, p(dark) = 
p(not dark) = 0. 

The important question now is how the agent should 
update the rest of her ranking function (including the 
ranks for the propositions about the color of the furni- 
ture) when her ranks for the propositions about the color 
of the carpet change as specified above. The answer, 
already formulated in Spohn (1988), is given by 

Spohn Conditionalization 

If the agent s epistemic state at time t is represented by 
the ranking function p on A, and if between t and t\ 
the agent's ranks on the partition {E_i \in \begin{cal}A \ 
endfcal}: E_i \in 1} change to n_i \in N\cup{\infty} 
with min_i{n_i}=0 (n_i=0 ifE_i=Wand n_i=\infty if 
E_i=\emptyset), and the agent's finite ranks change on 
no finer partition, then the agent's epistemic state at 
time t' should be represented by the ranking function 
p ' = minfrO \E t ) + r p . . ., p(- \EJ + r n , . . .} on A. 

Applied to our example this means that, at time t', 
the agent's rank for the proposition that the furniture 
is dark should be p'(dark) = min{p(dark\beige) + 1, 
p(dark\brown) + 0, p(dark\black) + 0} = 1. That is, at 
time t', the agent believes, if only very weakly, that 
the furniture is not dark. 

Spohn Conditionalization covers Plain Condition- 
alization as a special case. Shenoy (1991) presents an 
update rule for evidence of a still different format. 



JUSTIFICATION 

Ranking theory tells an epistemic agent how to orga- 
nize her beliefs, and how to update her beliefs when 
she receives new information of various formats. Why 
should the agent follow those prescriptions? 

The answer to this question requires a bit of terminol- 
ogy. An agent's degree of entrenchment for the proposi- 
tion A is the number of information sources providing 
the information A that it takes for the agent to give up 
her disbelief in A. If the agent does not disbelieve A to 
begin with, her degree of entrenchment for A is 0. If 
no finite number of information sources providing the 
information A makes the agent give up her disbelief in 
A, her degree of entrenchment for A is oo. 

Degrees of entrenchment are used to measure an 
epistemic agent's degrees of disbelief. If you want to 
measure my degree of disbelief for the proposition that 
Madrid is the capitol of Spain, you put me on a busy 
plaza in the center of Madrid and count the number of 
people passing by and telling me that Madrid is the 
capitol of Spain. My degree of entrenchment for the 
proposition that Madrid is the capitol of Spain equals 
n just in case I stop disbelieving that Madrid is the 
capitol of Spain after n people have passed by and told 
me it is - provided all those people are independent and 
equally reliable, indeed minimally positively reliable. 
Most people (and certainly all people in Madrid) are 
more than minimally positively reliable, though. An 
agent's degree of disbelief 'in A is therefore defined as 
the number of information sources providing the in- 
formation A that it would take for the agent to give up 
her disbelief that A if those information sources were 
independent and minimally positively reliable. 

Now we can explain why an agent's degrees of 
disbelief should obey the ranking calculus and thus be 
ranks, and why she should update her ranks according 
to Spohn Conditionalization. She should do so because 
doing so is necessary and sufficient for her to always 
have consistent and deductively closed beliefs. More 
precisely, Huber (2007) proves the following. 

Consistency Theorem 

An agents belief set is and will always be consistent 
and deductively closed in thefinite/ countable/ complete 
sense (and possibly conditional on some evidential 
proposition) if and only if this agent s degree of disbelief 
function is a finitely/countable/completely minimitive 
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ranking function and the agent updates according 
to Plain/Spohn/Shenoy Conditionalization when she 
receives information of the appropriate format. 

Seen this way, the axioms and update rules of 
ranking theory are nothing but a diachronic version 
of consistency and deductive closure. 



FUTURE TRENDS 

One question in artificial intelligence is how an agent 
should update her epistemic state if she learns new 
conceptual information without also learning anything 
factual about the world she lives in. There are several 
ways in which such a conceptual change may occur. The 
agent may learn a new concept as when an enological 
ignoramus learns the concept barrique. Or the agent 
may learn that she has omitted a possibility from her 
set of worlds as when an enological ignoramus learns 
that there are rose wines besides red and white wines. 
All these conceptual changes involve the adoption 
of a new set of worlds W and, consequently, a new 
field of propositions A on the side of the agent. None 
of these conceptual changes seems to be adequately 
modeled by any of the formalisms mentioned at the 
beginning. Ranking theory is able to adequately model 
those conceptual changes by employing the so called 
ur or tabula rasa ranking - i.e. that ranking function 
that assigns rank to every proposition. If the agent 
adds new possibilities to her set of worlds she should 
simply assign rank to all those new possibilities. 
Similarly in case the agent replaces the old worlds by 
richer worlds. Huber (2009) discusses this and other 
future trends. 

CONCLUSION 

Ranking functions are an indispensable tool for artifi- 
cial intelligence. First, they seem to adequately model 
most if not all of those phenomena that are dealt with 
in both qualitative as well as quantitative approaches 
to uncertainty. Second, they provide a link between 
these two classes of approaches that has been miss- 
ing so far. Third, they can deal with phenomena that 
neither qualitative nor quantitative approaches seem 
to be able to deal with. 
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KEY TERMS 

Belief: An agent with ranking function p: A — » 
iVu{oo} believes A if and only if p(W\A) > - equiva- 
lently, if and only if p(W\A) > p(A). 

Belief Set: The belief set of an agent with ranking 
function p: A — » iVu{oo} is the set of propositions the 
agent believes, Bel = {A e A: p(W\A) > 0}. 

Conditional Ranking Function: The conditional 
ranking function p(-|-)- AxA -^ iVu{oo} based on the 
ranking function p on A is defined such that for all 
propositions A, B in A: p(A\B) = p(AnB) - p(B) if A 
* 0, and p(0\B) = oo. 



Completely Minimitive Ranking Function: A 

ranking function p on a y-field of propositions A is 
completely minimitive if and only if p(uB) = min{ p(A) : 
A g B} for each set of propositions BcA. 

Countably Minimitive Ranking Function: A 

ranking function p on a a-field of propositions A is 
countably minimitive if and only if p(A u...uA u) 
= min{p(A 1 ), ..., p(A n ), ...} for all propositions A 1 e 

A,... A g A, ... 

Pointwise Ranking Function: A function k from 
the set of worlds W into the natural numbers N, k: W 
— » N, is a pointwise ranking function on Wif and only 
if k(w) = for at least one world w in W. 

Degree of Disbelief: An agent's degree of disbe- 
lief in the proposition A is the number of information 
sources providing the information A that it would take 
for the agent to give up her disbelief that A if those 
information sources were independent and minimally 
positively reliable. 

Degree of Entrenchment: An agent's degree of 
entrenchment for the proposition A is the number of 
information sources providing the information A that 
it takes for the agent to give up her disbelief in A. 

Ranking Function: A function p on a field of 
propositions A over a set of worlds W into the natural 
numbers extended by oo, p: A -^ iVu{oo}, is a (finitely 
minimitive) ranking function on A if and only if for all 
propositions A, B in A: p(W) = 0, p(0) = oo, p(Au£) 
= min{p(A), p(B)}. 
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INTRODUCTION 

A necessary condition for monitoring and control of 
a Power System (PS) is possessing a credible model 
of this system. The PS model for a need of dispatch- 
ers in national control centre is created in real time. 
An important element of such a model is a topology 
model. PS Topology Verification (PSTV) is an impor- 
tant problem in PS engineering. Often this problem is 
solved together with PS state estimation (Lukomski, 
& Wilkosz, 2000; Mai, Lefebvre, & Xuan, 2003). 
Methods, that enable such a solution of the problem, 
are sophisticated and usually time consuming. They 
require successful state estimation performance but 
convergence problems may occur in the case of certain 
Topology Errors (TEs). Thus, a robustmethodforPSTV 
before a state estimation is desired. 



BACKGROUND 

Now, the growth rate of Artificial Neural Networks 
(ANNs) application in some PS subjects is observed 
(Haque, & Kashtiban, 2005). One of such a subject is 
PSTV. It can be considered as a pattern recognition 
problem and then also utilization of ANN technique 
for solution of PSTV can be taken into account (Alves 
da Silva, & Quintana, 1995; Souza, Leite da Silva, & 
Alves da Silva, 1 996, 1 997, 1 998). There are many ref- 
erences in which PSTV with use of ANNs is described. 
In (Tian, Zhu & Zhang, 1995) use of ANN as a part of 
an expert system to rule extraction is presented. One of 
the first method for such PSTV has assumed utilization 
of one ANN for whole PS (Vinod Kumar, Srivastava, 
Shah, & Mathur, 1996). In the case of this method the 
complexity of the ANN structure grows rapidly with 
the size of a power network. There are the problems 



with learning and classification process in a case of 
large ANNs. In other attempts to solve the problem of 
PSTV with use of ANNs one can observe utilization 
of additional knowledge on PS (Garcia-Lagos, Joya, 
Marin, & Sandoval, 2003; Delimar, Hebel, & Pavic, 
2001, 2002, 2003a, 2003b). Such approach allows 
reducing size of utilized ANNs. The learning and 
classification process become more effective and the 
verification method is more efficient. The considered 
approach is also utilized in the case of the method, 
which is further presented. 



DESCRIPTION OF THE CONSIDERED 
SOLUTION 

To ensure that in the described method a larger knowl- 
edge on PS will be utilized than it is in other methods 
for PSTV, so-called unbalance indices are introduced. 
Taking into account the nature of the solved problem 
and to accomplish the best features of the PSTV, Radial 
Basis Function Networks (RBFNs) are utilized. 

Power System Model 

Elements of the PS topology model are nodes (repre- 
senting electrical nodes) and branches (representing 
power lines, transformers, loads etc.). The assumption, 
that every branch in a PS model is modeled as the n 
-equivalent circuit (Fig. 1), is adopted. It is assumed 
that there is an accessible credible measurement data 
set of such quantities as: active and reactive power 
flows at the ends of each branch, power injections, 
loads and voltage magnitudes at each node. Usually, if 
a branch is not included in PS model the measurement 
data related to the branch are not taken into account in 
carried out analyses. 
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Figure 1. The assumed n model of the branch, T L U - R 
+JX kp Y k =jB k „ Y, =jB lkJ B k! = B lk = B.B is a half of 
the capacitive susceptance of the branch. 
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Unbalance Indices 

Using Kirchhoff's and Ohm's Laws, PS canbe described 
by many relationships among measured quantities. If 
there are no TEs, all these relationships are fulfilled. 
When TE occurs some of the relationships become 
unfulfilled. It should be underlined that if a branch 
is not included in the PS model, the relationships for 
this branch are not considered, because measurement 
data for it are not taken into account. In the described 
approach to have possibility of examination of rela- 
tionships for all nodes and all branches independently 
of their correct or incorrect inclusion in the PS model 
the so-called unbalance indices for nodes and branches 
are introduced (Lukomski, 2002). These indices are 
shown in Table 1 . 



It should be noted that the nodal unbalance indices 
instead of power flow measurement data are taken into 
account when branch unbalance indices are calculated. 
This fact allows considering branch unbalance indices 
independently of correct or incorrect inclusion of 
branches in the PS model. 

Unbalance indices create characteristic sets of 
values for different cases of modeling PS. If the topol- 
ogy model is correct and there are no errors burdening 
measurement data, all nodal unbalance indices are equal 
to zero and branch unbalance indices are near to zero, 
as well. The same situation is, when there is a branch 
that is actually out of operation but it is included in 
the topology model (the inclusion error). If a branch 
is actually in operation in PS but it is not included in 
the topology model (the exclusion error), then; (i) the 
unbalance indices for terminal nodes of this branch 
considerably differ from zero, (ii) the unbalance indices 
for the considered branch are equal to zero, (iii) absolute 
values of the unbalance indices for other branches, that 
are incident to the nodes mentioned under (i), have 
especially large values. 

It should be stressed that the behavior of unbalance 
indices for active power and for reactive power is the 
same for the same TE. 

Analyzing unbalance indices for nodes and branches 
one can observe that the exclusion error of the branch 
j has no influence on: (i) unbalance indices for nodes, 
that are not terminal nodes of the branchy, (ii) unbal- 
ance indices for branches that are not incident to the 



Table 1. Active and reactive power unbalance indices for nodes and branches 





Node 


Branch 


Active 
power 


iel k 


W Pk i=-W Pk -W Pl +R kl W 


Reactive 
power 


w Qk = 5X 

i&I k 


W Q kl = -Wqk - Wqi +x kl w- B kl (y k 2 + V } 2 ) 


Description: W pk , W , - unbalance indices for the node k for active and 
reactive power respectively; W pM W - unbalance indices for the branch 
connecting the nodes k and / for active and reactive power respectively; / - a 
set of the nodes connected to the node k\ P , Q - active and reactive power 
flows in the branch connecting the nodes k and i at the node k; R , X M , B - n 
model parameters for the branch connecting the nodes k and / (Fig. 1); V k , V l 
- voltage magnitudes at the nodes k and / respectively; 

w _W^ k+ ^V Qk+ B kl V^) 2 
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The pre-processing standardization of each unbalance 
index is realized using Radial Basis Function (RBF) 
unit with Gaussian transfer function: 



f(w) = exp 



w 



2g 2 



(1) 



terminal nodes of the branchy. This observation shows • pre-processing of the unbalance indices (pre- 

existence of the local effect of TE. In this situation processing standardization), 

one can conclude about correctness of modeling the • local classification, 

distinguished branch j on the basis of investigations • global classification. 

of unbalance indices for certain areas of the power 

network: Aj, aj, where: k, l are numbers of the ter- The Pre-Processing Standardization 

minal nodes of the branchy. A* x e {k, 1} is the area, 
in which the branchy exists with the central node x. 
The area A* comprises: (i) the node x (being one of the 
terminal nodes of the branchy), (ii) the branchy and 
all other branches incident to the node x, (iii) all nodes 
which are connected with the node x by the branches 
mentioned under (ii). 

The Need of Use of ANNs 

The earlier considerations regarding the unbalance 
indices pertain to the ideal situation. In real situations, 
measurement data are burdened with errors and also 
one can occur multiple TEs. In such situations, the 
earlier-described effects of TEs, effects of occurrence 
of measurement errors and TEs other than the one, that 
is incorrect modeling the considered branch, overlap 
each other. In real situations, the problem of PSTV is 
a complex problem (Lukomski, 2002). Taking into 
account the analysis of the behavior of the unbalance 
indices, one can state that in the described situation 
the problem of PSTV can be treated as the problem of 
pattern recognition and then utilization of ANNs can 
be considered as a proper idea of the solution of the 
PSTV problem. 

On the basis of the earlier considerations it can be 
stated that the whole PSTV process can be decomposed 
into many simpler PSTV processes. One such process 
can be limited to the area A*. If one assumes utilization 
of ANNs then for each of the distinguished processes 
the separate ANN should be constructed. Possibly 
simple and fast learning ANNs is desired and therefore 
attention has been paid to RBFNs (Meireles, Almeida, 
& Simoes, 2003). 



where: w is a value of the considered unbalance index, 
o is the width parameter. 

If an unbalance index is close to zero, the RBF 
unit output is close to one. If an unbalance index is 
significantly different from zero, the RBF unit output 
is close to zero. The pre-processing standardization 
allows keeping input values for local classifiers (in the 
next step of the method) in the range (0; 1]. The a 
parameter for the index W pk is calculated as follows 
(errors are assumed to be independent): 



WPk 






2 
Pkl 



(2) 



where g d7 , is a standard deviation of data of the active 

Pkl 

power flow P ; a is a correction coefficient selected 
in an experimental way. 

The width parameter for the branch unbalance index 
W is given by: 



: a \® WPk ~*~ ® WPl 



), 



(3) 



Width parameters for unbalance indices for reac- 
tive power are calculated in the similar way. One has 
assumed a = 2 for active power unbalance indices and 
a = 1,8 for reactive power unbalance indices. 



Principle of the Method 



The Local Classification 



The proposed topology verification method consists 
of the following steps: 

calculation of unbalance indices for nodes and 
branches, 



The purpose of the considered step of the method is 
classification of correctness of modeling branches of 
PS. During the described step the local effect of TEs 
is taken into account. 

Each local classifier is RBFN. One local classifier 
corresponds to one node of a considered power network. 
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If the considered node has the number k then inputs 
for a local classifier, that corresponds to this node, are 
the results of the pre-processing of active and reactive 
power unbalance indices for: (i) the node k, (ii) the 
nodes having numbers from the set I k , (iii) each branch 
connecting the node k and the node /, under assump- 
tion / g I k . The number of outputs of a local classifier 
is equal to the number of branches connecting the 
node k with the nodes having numbers from the set / . 
The criterion for taking a decision on correctness of 
modeling a branch is as follows: 

the branch / is incorrectly modelled when Y < -0.5 

D, =<| the neutral decision when Ye (-0.5,0.5) 

the branch /is correctly modelled when Y > 0.5 

(4) 

where: D { is a decision, Y x is an output value correspond- 
ing to the branch between the node k and the node /. 

The Global Classification 

The global decision unit processes decisions of the 
local classifications and produces final decisions on 
correctness of modeling branches of PS. To take a 
final decision on correctness of modeling a selected 
branch the outputs of two local classifiers are consid- 
ered. These classifiers corresponding to the terminal 



nodes of the considered branch. If decisions of local 
classifiers are different and none of them is the neutral 
decision or each of the local classifiers produces the 
neutral decision then the final decision is the neutral 
one. In other cases the final decision is different from 
the neutral one. 

Computational Example 

The presented method was implemented in the MAT- 
LAB environment. The method has been tested using the 
IEEE 14-bus test system (Fig. 2). It has been assumed 
that: (i) all branches are actually in operation, (ii) single 
and multiple TEs are considered, (iii) measurement 
data are burdened with small errors (Gaussian noise), 
(iv) wide range of load curve changes is taken into 
account. 

Learning of each local classifier (being a RBFN) has 
been performed using Orthogonal Least Squares (OLS) 
algorithm (Chen, Cowan & Grant, 1991). Learning sets 
(200 - 400 learning patterns) were created separately 
for each RBFN. They comprised results of pre-pro- 
cessing unbalance indices and appropriate verification 
decisions. Learning of a RBFN was stopped when Sum 
Square Error (SSE) achieved the 10~ 4 level. 

For the distinguished RBFN the number of hidden 
units depends on a number of branches incident to 
the appropriate node. For the particular nodes of the 
test system many different topologies of RBFNs were 



Figure 2. The IEEE-14 bus test system 
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trained and tested. The characteristics of the local 
classifiers having the best performance is presented 
in Table 2. 

Testing the local classifiers has been performed 
with use of the test set having about 2000 patterns 
that had not been used in the training phase. The cases 
with single and double TEs were considered. In the 
cases with single TEs only the correct decisions were 
produced. In the cases with double TEs the correct and 
neutral decisions were observed. 

Table 3 shows a probability of taking the neutral 
decision p n in the verification process for the differ- 
ent branches of the test system when there are double 
TEs. During the test stage, some doubtful cases have 
occurred and the neutral decisions have been taken for 
the branches with numbers 19 and 20. In these cases 
there has been no possibility to state the correctness of 
the considered branch in the test system. A reason was 
relatively small level of power flows in the mentioned 
branches. The obtained results show that the efficiency 
of the RBF classifiers is very high. 



FUTURE TRENDS 

Utilization of ANNs to handle the problem of PSTV 
seems to be very promising. However, the up-to-date 
methods do not give satisfying results in all possible 
real cases. The analyses have revealed that the ap- 
plication of pure neural models is not too effective. 
Utilization of ANNs and additional knowledge on 
PSs can result in much more efficient solutions. Also, 
it should be stressed that combining various artificial 
intelligence techniques can give interesting solutions 
from the view point of efficiency and performance time 
of a PSTV process. 



CONCLUSION 

The presented method allows performing PSTV inde- 
pendently of state estimation. It combines knowledge 
on PS and utilization of RBFNs. It utilizes the local 
effect of TEs. The whole PSTV process comprises 
many local processes realized by use of the classifiers 
assigned to the nodes of a power network. It makes 
possible to avoid constructing a large and complex 
ANN for a whole power network, as it is made in 
(Vinod Kumar, Srivastava, Shah, & Mathur, 1996). 



Table 2. Characteristics of the local RBF classifiers corresponding to the nodes of the IEEE 14-bus test sys- 
tem. N- the number of the node, N. - the number of inputs, N, - the number of hidden units, N , - number of 
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Table 3. A probability of taking the neutral decision p n in the verification process for the different branches of 
the test system when there are double topology errors. 
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Taking into account the decomposition of the PSTV 
process, the described method is close to the method 
from (Garcia-Lagos, Joya, Marin & Sandoval, 1998, 
2003) and also to the method from (Delimar, Hebel 
& Pavic, 2001, 2002, 2003a, 2003b). However, the 
characterized method utilizes larger knowledge on PS 
than the method from (Garcia-Lagos, Joya, Marin & 
Sandoval, 2003) or the method from (Delimar, Hebel & 
Pavic, 200 1 , 2002, 2003a, 2003b). A consequence of this 
fact is decreasing sizes of ANNs of which utilization is 
assumed by the here-considered method in comparison 
with the method from (Garcia-Lagos, Joya, Marin & 
Sandoval, 2003) or the method from (Delimar, Hebel 
& Pavic, 2001, 2002, 2003a, 2003b). 

The method assumes that the local classifiers are 
RBFNs. Their learning process is relatively short, com- 
paring with multilayer feedforward neural networks. 
Using the OLS algorithm gives fast learning conver- 
gence. However, it should be stressed that RBFNs have 
much higher number of hidden units in comparison 
with multilayer feedforward neural networks. 

The described method is capable to handle single 
and multiple TEs. It allows performing very efficient 
PSTV. Another advantage of the method is low sensi- 
tivity of the PSTV process quality to changes of PSs 
load curve. 
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KEY TERMS 

Neutral Decision: In fact, the lack of any deci- 
sion. 

Orthogonal Least Squares (OLS) Algorithm: Al- 
gorithm describing a Gram-Schmidt orthogonalisation 
process which ensures that each new column added to 
the result matrix of the growing subset is orthogonal to 
all previous columns. This considerably simplifies the 
equation for the change in learning error and results in 
a more efficient algorithm. 

Power System State Estimation: Aprocess, which 
leads to calculation of a power system state vector us- 



ing incoming measurement data and a mathematical 
power system model. A power system state vector fully 
specifies any state in which a power system can be. 

Power System Topology Error: Inconsistency 
among the real power network connectivity and the 
power system topology model. 

Power System Topology Model: A description of 
the physical connections in a power system. 

Power System Topology Verification: Proving or 
disproving the correctness of a power system topology 
model. 

Radial Basis Function Network: A type of artifi- 
cial neural network which uses radial basis functions 
as activation functions. Typically, it consists of one 
hidden layer of Radial Basis Function (RBF) neurons 
(units). RBF hidden layer units have a receptive field 
which has a centre: that is, a particular input value at 
which they have a maximal output. Their output tails 
off as the input moves away from this point. Generally, 
the hidden unit function is a Gaussian. They are used 
in classification and approximation problems. 

Unbalance Index: The left-hand side of the ap- 
propriate relationship, considered in the form in which 
its right-hand side is equal to zero. The mentioned 
relationship is a balance of active (reactive) powers at 
a node or a relationship among active (reactive) power 
flows at the ends of a branch. 
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INTRODUCTION 

Self-organising neural networks try to preserve the to- 
pology of an input space by means of their competitive 
learning. This capacity has been used, among others, 
for the representation of objects and their motion. In 
this work we use a kind of self-organising network, 
the Growing Neural Gas, to represent deformations 
in objects along a sequence of images. As a result of 
an adaptive process the objects are represented by a 
topology representing graph that constitutes an induced 
Delaunay triangulation of their shapes. These maps 
adapt the changes in the objects topology without reset 
the learning process. 



BACKGROUND 

Self-organising maps, by means of a competitive learn- 
ing, make an adaptation of the reference vectors of the 
neurons, as well as, of the interconnection network 



among them; obtaining a mapping that tries to preserve 
the topology of an input space. Besides, they are able of 
a continuous re-adaptation process even if new patterns 
are entered, with no need to reset the learning. 

These capacities have been used for the representa- 
tion of objects (Florez, Garcia, Garcia & Hernandez, 
2001)] (Figure 1) and their motion (Florez, Garcia, 
Garcia & Hernandez, 2002) by means of the Growing 
Neural Gas (GNG) (Fritzke, 1995) that has a learning 
process more flexible than other self-organising models, 
like Kohonen maps (Kohonen, 2001). 

These two applications, representation of objects 
and their motion, have in many cases temporal con- 
straints, reason why it is interesting the acceleration of 
the learning process. In computer vision applications 
the condition of finalization for the GNG algorithm 
is commonly defined by the insertion of a predefined 
number of neurons. The election of this number can 
affect the quality of the adaptation, measured as the 
topology preservation of the input space (Martinetz 
&Schulten, 1994). 



Figure 1. Representation of two-dimensional objects with a self-organising network 
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In this work GNG has been used to represent two- 
dimensional objects shape deformations in sequences 
of images, obtaining a topology representing graph 
that can be used for multiple tasks like representation, 
classification or tracking. When deformations in objects 
topology are small and gradual between consecutive 
frames in a sequence of images, we can use previous 
maps information to place the neurons without reset the 
learning process. Using this feature of GNG we achieve 
a high acceleration of the representation process. 

One way of selecting points of interest in 2D shapes 
is to use a topographic mapping where a low dimensional 
map is fitted to the high dimensional manifold of the 
shape, whilst preserving the topographic structure of 
the data. A common way to achieve this is by using 
self-organising neural networks where input patterns 
are projected onto a network of neural units such that 
similar patterns are projected onto units adjacent in the 
network and vice versa. As a result of this mapping a 
representation of the input patterns is achieved that 
in post-processing stages allows one to exploit the 
similarity relations of the input patterns. Such models 
have been successfully used in applications such as 
speech processing (Kohonen, 200 1 ), robotics (Ritter & 
Schulten, 1986), (Martinez, Ritter, & Schulten, 1990) 
and image processing (Nasrabati & Feng, 1 988). How- 
ever, most common approaches are not able to provide 
good neighborhood and topology preservation if the 
logical structure of the input pattern is not known a 
priori. In fact, the most common approaches specify 
in advance the number of neurons in the network and a 
graph that represents topological relationships between 
them, for example, a two-dimensional grid, and seek 
the best match to the given input pattern manifold. 
When this is not the case the networks fail to provide 
good topology preserving as for example in the case 
of Kohonen's algorithm. 



REPRESENTATION AND TRACKING OF 
NON-RIGID OBJECTS WITH TOPOLOGY 
PRESERVING NEURAL NETWORKS 

This section is organized as follows: first we provide 
a detailed description of the topology learning algo- 
rithm GNG. Next an explanation on how GNG can be 
applied to represent objects that change their shapes 
in a sequence of images is given. And finally a set of 



experimental results using GNG to represent different 
input spaces is presented in. 

The approach presented in this paper is based on self- 
organising networks trained using the Growing Neural 
Gas learning method (Fritzke, 1995), an incremental 
training algorithm. The links between the units in the 
network are established through competitive hebbian 
learning (Martinetz, 1994). As a result the algorithm 
can be used in cases where the topological structure of 
the input pattern is not known a priori and yields topol- 
ogy preserving maps of feature manifold (Martinetz & 
Schulten, 1994). 

Recent studies has presented some modifications of 
the original GNG algorithm to improve the robustness 
of the cluster analysis (Cselenyi, 2005), (Cheng & Zell, 
2000), (Qin & Suganthan, 2004), (Toshihiko, Iwasaki 
& Sato, 2003), but none of them use the structure of 
the map as starting point to represent deformations in 
a sequence of objects shapes. 

Growing Neural Gas 

With Growing Neural Gas (GNG) (Fritzke, 1995) a 
growth process takes place from a minimal network 
size and new units are inserted successively using a 
particular type of vector quantisation (Kohonen, 200 1 ). 
To determine where to insert new units, local error 
measures are gathered during the adaptation process 
and each new unit is inserted near the unit which has 
the highest accumulated error. At each adaptation step a 
connection between the winner and the second-nearest 
unit is created as dictated by the competitive hebbian 
learning algorithm. This is continued until an ending 
condition is fulfilled, as for example evaluation of the 
optimal network topology based on some measure. 
Also the ending condition could it be the insertion of a 
predefined number of neurons or a temporal constrain. 
In addition, in GNG networks learning parameters are 
constant in time, in contrast to other methods whose 
learning is based on decaying parameters. 

In the remaining of this Section we describe the 
growing neural gas algorithm and ending condition as 
used in this work. The network is specified as: 

A set N of nodes (neurons). Each neuron c e N has 
its associated reference vector w e R d . The reference 

c 

vectors can be regarded as positions in the input space 
of their corresponding neurons. 
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A set of edges (connections) between pairs of 
neurons. These connections are not weighted and its 
purpose is to define the topological structure. An edge 
aging scheme is used to remove connections that are 
invalid due to the motion of the neuron during the 
adaptation process. 

The GNG learning algorithm to approach the net- 
work to the input manifold is as follows: 

1. Start with two neurons a and b at random posi- 
tions w and w u in R d . 

a b 

2. Generate a random input pattern ^ according to 
the data distribution P(Q of each input pattern. 
In our case since the input space is 2D, the input 
pattern is the (x,y) coordinate of the points be- 
longing to the object. Typically, for the training 
of the network we generate 1000 to 10000 input 
patterns depending on the complexity of the input 
space. 

3. Find the nearest neuron (winner neuron) s x and 
the second nearest s 2 using squared Euclidean 
distance. 

4. Increase the age of all the edges emanating from 

s v 

5. Add the squared distance between the input signal 

and the winner neuron to a counter error of s x such 
as: 



Aerror(s 1 ) = w Si -a 



(i) 



Move the winner neuron s 1 and its topological 
neighbours (neurons connected to s x ) towards % 
by a learning step e w and e n , respectively, of the 
total distance: 



Aw Sl =£w(^" W s 1 ) 

A w< =e n (£-w ) 



(2) 
(3) 



Insert a new neuron r between q and its 
further neighbour f: 



w r = 0.S\w q + 



,) 



(4) 



9. 



Insert new edges connecting the neuron r 
with neurons q and f, removing the old edge 
between q and f. 

Decrease the error variables of neurons q 
and f multiplying them with a constant a. 
Initialize the error variable of r with the 
new value of the error variable of q and f. 

Decrease all error variables by multiplying them 

with a constant [3. 

If the stopping criterion is not yet achieved, go 

to step 2. (In our case the criterion is the number 

of neurons inserted) 



Representation of 2D Objects with GNG 

Given an image I(x.,y) e ^ we perform the transforma- 
tion y cr\K,>y)= TylQcy)) that associates to each one 
of the pixels its probability of belonging to the object, 
according to a property T . For instance, in figure 2, 
this transformation is a threshold function. 

If we consider £ = (x, y) and &(* )=y r (x ), we 
can apply the learning algorithm of the GNG to the 
image /, so that the network adapts its topology to the 
object. This adaptive process is iterative, so the GNG 
represents the object during all the learning. 

As a result of the GNG learning we obtain a graph, 
the Topology Preserving Graph T&Q = (W,C) , with a 
vertex (neurons) set W and an edge set C that connect 
them (figure 1). This ^(pg establishes a Delaunay tri- 
angulation induced by the object (O'Rourke, 2001). 



6. 



7. 



If s 1 and s 2 are connected by an edge, set the age 
of this edge to 0. If it does not exist, create it. 
Remove the edges larger than a . If this results 

max 

in isolated neurons (without emanating edges), 
remove them as well. 

Every certain number X of input signals generated, 
insert a new neuron as follows: 
• Determine the neuron q with the maximum 
accumulated error. 



Figure 2. Silhouette extraction 
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Representing Topology Deformations in 
Objects 

The model is able also to characterize different parts of 
an object, or several present objects in the scene that 
had the same values for the visual property T , without 
reset the different data structures for each one of the 
objects. This is due to the GNG capacity to divide itself 
into different parts when removing neurons and can be 
very useful to represent objects that change their topo- 
logical structure breaking into small pieces or changing 
their shapes along a sequence of images. In this case 
a modification in the original algorithm of GNG must 
be done generating in step 2 a higher number of input 
signals to readapt from the previous map to the new 
image and avoiding steps 8 and 9 where neurons are 
deleted or added if necessary. None of the modifications 
of the original GNG algorithm to improve the robust- 
ness of the cluster analysis (Cselenyi, 2005), (Cheng 
& Zell, 2000), (Qin & Suganthan, 2004), (Toshihiko, 
Iwasaki & Sato, 2003) use the structure of the map as 
a starting point to represent deformations in a sequence 
of objects shapes. 

In this work GNG has been used to represent two- 
dimensional objects shape deformations in sequences 
of images, obtaining a topology representing graph. 



When deformations in objects topology are small and 
gradual between consecutive frames in a sequence of 
images, we can use previous maps information to place 
the neurons without reset the learning process. Using 
this feature of GNG we achieve a high acceleration of 
the representation process. 

For example in figure 3 are represented some objects 
with colour as a common feature in both images, that 
represent the same objects but as a foreground in white 
on the left and as a background in black on the right. 

Experiments 

To illustrate GNG capacities to represent topological 
deformations in obj ects, we have adapted the maps to an 
object shape that changes its topology from a compact 
square into four small squares in four steps (frames) 
obtaining graphs that represent the topology of the 
object shape along the images sequence but without 
reset the learning process for any image. 

Figure 4 shows the original sequence of images 
used as input space for the self-organising map where 
from a homogenous square in the first image (on the 
left) four small squares are created in the last image 
(on the right). On the bottom of the figure are showed 
the results of the GNG adaptation establishing white 



Figure 3. Representation of objects with similar visual properties as foreground and background 
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Figure 4. Results of GNG adaptation to changes in the input space 





colour as a visual property of objects to be represented. 
From the first map (on the left), new maps are obtained 
based on the previous one without reset the learning 
process. This feature of GNG allows an acceleration 
of the images sequence representation. 

As can be seen in the sequence of images, the map is 
able to separate the neurons into four groups represent- 
ing the different squares in the original images when 
the distance between them is higher than the average 
of length of the edges that connects the neurons. 

Figure 5 represents a sequence of deformations from 
a small circle to an ellipse and finally to a square used 
as input space to the GNG. The results of the adapta- 



tion of the map without reset the learning algorithm 
between frames are showed. 

The parameters used for the simulation are: N=100, 
X = 1000 for the first map and 10000-20000 for the 
subsequent maps, e w = 0.1, e n = 0.001, a = 0.5, p = 
0.95, a = 250. 

7 max 

The computational cost to represent a sequence 
of deformations is very low, compared with methods 
based on the adaptation of a new map for any frame 
of the sequence, since our method does not reset the 
algorithm for new frames. This feature provides the 
method with real-time capabilities. 



Figure 5. Object deformation with GNG adaptation 
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FUTURE TRENDS 

The iterative and parallel performance of the presented 
representation model is the departure point for the 
development of high performance architectures that 
supply a characterization and tracking of non-rigid 
objects depending on the time available. 



CONCLUSION 

In this paper, we have demonstrated the GNG capacity 
of representation of bi-dimensional objects. Establish- 
ing a suitable transformation function, the model is able 
to adapt its topology to the shape of an object. Then, 
a simple, but very rich representation of the objects 
is obtained. 

The model, by its own adaptation process, is able 
to divide itself so that it can characterize different frag- 
ments from an object or different objects in the same 
image. In addition, GNG can represent deformations in 
objects topology representing them along a sequence 
of images without reset the learning process. This 
feature accelerates the process of representation and 
tracking of objects. 
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KEY TERMS 

Growing Neural Gas: A self-organizing neural 
model where the number of units is increased during 
the self-organization process using a competitive Heb- 
bian learning for the topology generation. 

Hebbian Learning: Atime-dependent, local, highly 
interactive mechanism that increases synaptic efficacy 
as a function of pre- and post-synaptic activity. 



1368 



Representing Non-Rigid Objects with Neural Networks 



Non-Rigid Objects: A class of objects that suf- Object Representation: Is the construction of a 

fer deformations changing its appearence along the formal description of the object using features based 
time. on its shape, contour or specific region. 

Object Tracking: Is a task within the field of Topology Preserving Graph: Is a graph that rep- 
computer vision that consists on the extraction of the resents and preserves the neighbourhood relations of 
motion of an object from a sequence of images estimat- an input space, 
ing its trajectory. 

Self-Organising Neural Networks: A class of 
artificial neural networks that are able to self-orga- 
nize themselves to recognize patterns automatically 
without previous training preserving neighbourhood 
relations. 
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INTRODUCTION 

Updates, is a central issue in relational databases and 
knowledge databases. In the last years, it has been well 
studied in the non-monotonic reasoning paradigm. 
Several semantics for logic program updates have 
been proposed (Brewka, Dix, & Knonolige 1997), (De 
Schreye, Hermenegildo, & Pereira, 1 999) (Katsumo & 
Mendelzon, 1991). However, recently a set of propos- 
als has been characterized to propose mechanisms of 
updates based on logic and logic programming. All these 
mechanisms are built on semantics based on structural 
properties (Eiter, Fink, Sabattini & Thompits, 2000) 
(Leite, 2002) (Banti, Alferes & Brogi, 2003) (Zacarias, 
2005). Furthermore, all these semantic ones coincide in 
considering the AGM proposal as the standard model 
in the update theory, for their wealth in properties. The 
AGM approach, introduced in ( Alchourron, Gardenf ors 
& Makinson, 1985) is the dominating paradigm in the 
area, but in the context of monotonic logic. All these 
proposals analyze and reinterpret the AGM postulates 
under the Answer Set Programming (ASP) such as 
(Eiter, Fink, Sabattini & Thompits, 2000). However, the 
majority of the adapted AGM and update postulates are 
violated by update programs, as shown in (De Schreye, 
Hermenegildo, & Pereira, 1999). 



UPDATES 

Update theory deals with knowledge base represented 
by a propositional theory. Besides, deals with incor- 
porating new knowledge about a dynamic world. This 
dynamism is due to knowledge comes from the real 



world, what means that knowledge evolves over time. 
This exchange rate mainly deals with changes in the 
extensional part of knowledge bases. However, the 
problem of updating the intensional part of a knowl- 
edge base (rules and descriptions of actions) remains 
basically unexplored. However, the problem of updates 
has attracted the researchers' attention in the last years 
who are dealing with such updates in the setting of logic 
programs. Though, some interesting proposals exist 
with foundation in Answer set programming (ASP), 
such as (Eiter, Fink, Sabattini & Thompits, 2000) 
(Leite, 2002) (Banti, Alferes & Brogi, 2003) (Osorio 
& Zacarias, 2003). 

Answer set programming is a new paradigm 
used in the solution of the update issue. Particularly, 
this paradigm has taken bigger force around of update 
theory. A lot of theoretical work around updates under 
ASP has been developed by connoted researchers such 
as: Pereira, Alferes, Eiter, Osorio, Leite, Zacarias, 
and others. In the last years, a lot of theoretical work 
was devoted to explore the relationships between 
intuitionistic logic and ASP (Pearce, 1999) (Lifschitz, 
Pearce & Valverde, 2001). These results have recently 
provided a characterization of ASP by intuitionistic 
logic as follows: a literal is entailed by a program in 
the answer set semantics if and only if it belongs to 
every intuitionistically complete and consistent exten- 
sion of the program formed by adding only negated 
literals (Pearce, 1999). The idea of these completions 
using in general intermediate logics is due to Pearce 
(Lifschitz, Pearce & Valverde, 2001). This logical ap- 
proach provides the foundations to define the notion of 
non-monotonic inference of any propositional theory 
(using the standard connectives) in terms of a mono- 
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tonic logic (namely intuitionistic logic), see (Lifschitz, 
Pearce & Valverde, 2001) (Pearce, 1999). 



STARTING WITH AGM 

We start with an analysis on the AGM postulates and 
then we examine them with respect to update sequences. 
All these proposals are based on oneself principle of 
causal rejection principle. As is well known, if new 
knowledge of the world is somehow obtained, and it 
does not have conflicts with the previous knowledge 
then this new knowledge only expands knowledge. If 
by the contrary, new knowledge is inconsistent with 
the previous knowledge, and we want knowledge to be 
always consistent in all moment, we should solve this 
problem somehow. We point out that new information 
is incorporated into the current knowledge base subject 
to a causal rejection principle, which enforces that, in 
case of conflicts between rules, more recent rules are 
preferred and older rules are overridden. 

An update theory is a knowledge base represented 
by a logic program. Then, let P be the program repre- 
senting the current knowledge base, if it is updated by 
another program U, then P u is a program updated of P if 
only if the models of Py are the result of updating each 
of the models of P according to a given semantics S; 
to each of these models apply the update request U to 
obtain a new set of models M; P y is any logic program 
whose models are exactly M. 

The AGM approach proposes three basic operations 
on a belief set K: a) expansion K + O, which is simply 
adding the new information ®e£ B to K. b) revision K 
* O, which is sensibly revising K in the light of O (in 
particular, when K contradicts O); and c) contraction 
K — O, which is removing O from K. 

On the other hand, AGM proposes a set of postulates, 
K*l — K*8, that any revision operator * mapping a 
belief set K c £ B and a sentence O e £ B into the revised 
belief set K * O should satisfy. We assume that K is 
represented by an epistemic state E, then the postulates 
K*l — K*8 can be reformulated as in (Eiter, Fink, 
Sabattini & Thompits, 2000) as follows: 

(Kl) E * O represents a belief set. 

(K2)0eBel(E*O). 

(K3) Bel(E * O) c Bel(E + O). 

(K4) -O^Bel(E) implies Bel(E + O) c Bel(E * O). 

(K5) ±eBel(E * O) only if O is unsatisfiable. 



(K6) 0^= 2 implies Bel(E * O^ = Bel(E * $ 2 ). 

(K7) Bel(E * (O a y )) e Bel((E * O) + y). 

(K8) -y € Bel(E * O) implies Bel((E * O) + y) c Bel(E 

* (0 A y)). 

Katsuno and Mendelzon (1991) proponed a set of 
postulates where a change O to a belief base B are 
propositional sentences over a finitary language. Some 
of the outstanding differences between the postulates 
of the AGM and those of Katsuno and Mendelzon are 
that revision should yield the same result as expan- 
sion E + O, providing O is compatible with E, which 
is not desirable for update in general. The postulate 8 
says that if E can be decomposed into a disjunction of 
states (e.g., models), then each case can be updated 
separately and the overall result is formed by taking 
the disjunction of the emerging states. 

Darwiche and Pearl (1997) have proposed postu- 
lates for iterated revision. This set of postulates is very 
simple and the maj ority of the adapted AGM and update 
postulates are violated by update programs. Another 
set of postulates for iterated revision, corresponding 
to a sequence E of observations, has been formulated 
by Lehmann (1995). Notice that in general the postu- 
lates proposed for iterated revision fail, and, with the 
exception of some postulates, each change is given by 
a single rule. Though, is that the two views described 
above amount to the same at a technical level. 

All these approaches on the update issue consider 
it as a process of belief revision. However, following 
Gardenfors and Makinson (1991; 1994), belief revi- 
sion can be related to non-monotonic reasoning by 
interpreting it as an abstract consequence relation on 
sentences, where the epistemic state is fixed. In the 
same way as Eiter we can interpret update programs 
as abstract consequence relation on logic programs. In 
spite of this, we should consider these proposals since 
for example Makinson (1993) considered a set of (de- 
sirable) properties for non-monotonic reasoning, and 
analyzed the behavior of some reasoning formalisms 
with respect to these properties. 

Continuing with our research, immediately we 
comment in a general way the proposal of Alf eres et. 
al., (2000). They introduced the concept of dynamic 
logic programs as a generalization of both the idea of 
updating interpretations through revision programs and 
of updating programs as defined by Alferes and Pereira 
(1997) and by Leite and Pereira (1997). Syntactically, 
dynamic logic programs are based on generalized logic 
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programs (GLPs), which allow default negation in 
the head of rules, but no strong negation whatsoever 
(Eiter, Fink, Sabattini & Thompits, 2000). The way in 
that the models of a update sequence are defined by 
Alferes et. al., is similar to the transformation used by 
Eiter et. al. These are defined as the stable models of 
the program resulting from a syntactic rewriting. This 
is called a dynamic update. Elements of the sequence 
are generalized logic programs. 

Alferes et. al. defined in (Alferes & Pererira, 2002) 
its semantics by means of a dynamic logic programming 
generated by the sequence of commands. Afterwards, 
a translation of these commands, (a LUPS program) 
to a generalized logic program where stable models 
exactly correspond to the semantics of the original 
LUPS program. In this proposal the authors considering 
that the knowledge evolves from one knowledge state 
to another. Thus, given the current knowledge state 
KS, its successor knowledge state KS[U] is produced 
as a result of the occurrence of a non-empty set U 
of simultaneous updates. Each of the updates can be 
viewed as a set of actions and consecutive knowledge 
states are obtained as: 

KSn = KS [U 1 ][U 2 ]...[Un] 

where U.'s represent consecutive sets of updates. This 
state is denote by: 



KSn = IT 



'U 2 e 



0Un 



Thus, in dynamic logic programming the models of 
a sequence of updates are defined as the stable models 
of the program resulting from a syntactic rewriting. 
In (Alferes & Pererira, 2002) it is demonstrated that 
revision programs and dynamic updates are equivalent, 
provided that the original knowledge is extensional, 
i.e., the initial program contains only rules of the form 
A^- or not A^. 

One major difference can immediately be identified 
between our update programs and dynamic updates: In 
dynamic updates, the value of each atom is determined 
from the bottom level P : upwards towards Pn. the dif- 
ferent evaluation strategy leads in effect to different 
semantics. Furthermore, Alferes et al. (2000) use a 
slightly non-standard concept of stable models. There 
is a semantic difference between dynamic updates and 
updates according to Eiter et.al (Eiter, Fink, Sabattini 
& Thompits, 2000). 



On the other hand, one of the proposals more grate- 
ful on updates corresponds to (Eiter, Fink, Sabattini & 
Thompits, 2000). The authors in (Eiter, Fink, Sabattini 
& Thompits, 2000) redefine and implement an update 
process inspired in the proposal defined by Alferes 
et. al. you can refer to (Alferes & Pereira, 2002). The 
proposal (Eiter, Fink, Sabattini & Thompits, 2000) 
makes an exhaustive analysis of recent proposals 
based on non-monotonic logic. There, a syntactic re- 
definition of dynamic logic programs is presented, and 
semantically properties are investigated. In particular, 
a study on the dynamic logic programs verification of 
well known postulates of belief revision (Alchour- 
ron, Gardenfors & Makinson, 1985) is carried out. 
Also, structural properties of logic program updates 
are studied in (Eiter, Fink, Sabattini & Thompits, 
2000). However, as happens in all works presented so 
far, most of the presented properties are not satisfied. 
This fact motivated our investigation to work towards 
a properties-based theory. 

This is an approach to update non-monotonic 
knowledge bases represented as extended logic pro- 
grams under the answer set semantics. They consider 
refinements of the semantics on the notion of minimal- 
ity of change. This proposal proposes a mechanism 
for updates based on a sequence of logic programs. 
Informally, this program expresses layered derivability 
of a literal L, beginning from the top layer P n and con- 
tinuing downwards to the bottom layer P r The rule r 
layer Pi is only applicable if it is not refuted by a literal 
derived at a higher level that is compatible with H(r). 
Inertia rules propagate a locally derived value for L 
downwards to the first level, where the local value is 
made global. 

Continuing in this direction, we have been working 
in finding properties that our update operator satisfies 
(Osorio & Zacarias, 2003) (Zacarias & Osorio, 2005) 
(Zacarias, Osorio, & Arrazola, 2005). Our purpose is to 
build a semantics based on structural properties. This is 
our main objective in the update theory. In (De Schreye, 
Hermenegildo, & Pereira, 1999) (Osorio & Zacarias, 
2003) (Zacarias, Osorio & Arrazola, 2005) (Zacarias, 
2005) the authors present a set of properties that the 
update operator satisfies. In this paper we continue with 
this same research line presenting a novel proposal 
with the aim to enrich the update theory that we have 
begun in (Osorio & Zacarias, 2003) (Zacarias, Osorio & 
Arrazola, 2005) (Zacarias, 2005). This novel proposal 
contributes with two benefits. First, we conserve many 
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of the properties presented in previous works (Osorio 
& Zacarias, 2003) (Zacarias, Osorio & Arrazola, 2005) 
(Zacarias, 2005), such as: Weak Irrelevance of Syntax 
(WIS). This property is similar to one postulate proposed 
by AGM, but in this case for nonmonotonic logic and 
under Answer Set Programming (ASP) introduced 
and defined by (Gelfond & Lifschitz, 1988). 

On the other hand, we conclude that many ap- 
proaches about program updates do not satisfy many 
of the properties defined in the literature (Alchourron, 
Gardenfors & Makinson, 1 985) (Eiter, Fink, Sabattini & 
Thompits, 2000) (Katsuno & Mendelzon, 1991) (Banti, 
Alferes & Brogi, 2003). This is partly explained by the 
non-monotonicity of logic programs and the causal 
rejection principle embodied in the semantics, which 
strongly depends on the syntax of rules. Furthermore, 
we consider that a good update theory is based funda- 
mentally on a set of properties. 

As result of a first analysis of a proposal presented 
in (Eiter, Fink, Sabattini & Thompits, 2000), we in- 
troduced in (Osorio & Zacarias, 2003), a new update 
operator. This proposal satisfies several properties of 
AGM postulates, among them, a new property called 
Weak Irrelevance of Syntax. These properties give to 
an agent an added value with respect to other proposals 
that do not satisfy them. It is necessary to highlight the 
simplicity of our proposal, which allows to an agent to 
be able to respond in a correct and opportune way. 

Continuing our analysis on updates we present 
our main results about updates of logic programs: a 
properties-based approach published in (Zacarias, 
2005). In this proposal we presented several proper- 
ties on theory updates. We consider these properties 
from a non-monotonic reasoning perspective, by natu- 
rally interpreting program updates as non-monotonic 
consequence relations. In this proposal we consider 
our properties under N logic. Additionally, we have 
presented in (Zacarias, 2005) some examples about 
updates on answer set programming. 

In (Zacarias, 2005) we have introduced a new 
proposal towards the enrichment of the update opera- 
tor "©". There, we have presented a refinement of the 
stable model semantics for the update operator. Also, we 
presented a new property that allows us to face updates 
where new information contains rules that define a con- 
servative extension. So, we gave an extension of our 
properties proven in (Osorio & Zacarias, 2003), under 
N logic. This approach is based on the work made by 



Eiter et al. (Eiter, Fink, Sabattini & Thompits, 2000), 
and inspired in a recent approach presented by Alferes 
et al. (Banti, Alferes & Brogi, 2003). With this work, 
we improve and enrich the update operator proposed 
by Eiter etal. (Eiter, Fink, Sabattini & Thompits, 2000), 
giving as result a new update operator. 



FUTURE TRENDS 

Just as in (Eiter, Fink, Sabattini & Thompits, 2000) we 
coincide that because of apparent lack of minimality of 
change, we then considered refinements of the semantics 
in terms of minimal and strictly minimal answer sets. 
Several issues remain for further work. An interesting 
point (Eiter, Fink, Sabattini & Thompits, 2000) concerns 
the formulation of postulates (principles or proper- 
ties) for update operator on logic programs and, more 
generally, on non-monotonic theories. As you can see 
in (Eiter, Fink, Sabattini & Thompits, 2000), several 
postulates from the area of logical theory change fail 
for update programs. This may be explained by the 
dominant role of syntax for update embodied by causal 
rejection of rules. 



CONCLUSIONS 

In this paper, we considered a new proposal to provide 
an update process to our agents. Our proposal is a 
novel and simple methodology that allows an agent to 
maintain updated its knowledge base in all moment. This 
provides an agent to behave in a rational way, similar 
to human behavior. Furthermore, it is an appropriate 
proposal for applications that require answers in real 
time. Also, this proposal opens the possibilities for 
building real-life applications, like intelligent agents 
whose rational component is modelled by a knowledge 
base, which is in turn maintained using update logic 
programs. 
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KEY TERMS 

Beliefs: An agent whose knowledge base is the 
theory T believes F if and only if F belongs to every 
intuitionistically complete and consistent extension of 
Thy adding only negated literals. 

Causal Rejection Principle: Which enforces that, 
in case of conflicts between rules, more recent rules 
are preferred and older rules are overridden. 

Equivalence: Two programs are equivalent if they 
have exactly the same answer sets. 

Expansion: Which is simply adding the new infor- 
mation A to knowledge base KB. 

Principle of Irrelevance of Syntax: The meaning 
of the knowledge that results from an update must be 
independent of the syntax of the original knowledge, 
as well as independent of the syntax of the update 
itself. 
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Update: Let P be the program representing the Weak Irrelevance of Syntax: TJAT 2 implies Bel(K 

current knowledge base, if it is updated by another H TJ = Bel(KM T), where K, T 1 and T 2 are any theo- 

program U, then P y is a program updated of P if only if ries, Bel(T) defines the set of answer sets of T, M is the 

the models of P y are the result of updating each of the update operator, and understanding that equivalence 

models of P according to a given semantics S; to each means that both programs (T 1 and T 2 ) have the same 

of these models apply the update request U to obtain a answer sets, 
new set of models M; P y is any logic program whose 
models are exactly M. 
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INTRODUCTION 

A general goal of biologically inspired robotics is to 
learn lessons from actual biological systems and to find 
applications in robot design. Neural controllers and 
adaptive algorithms are major tools to model, at some 
level of abstraction, functions, structures, and behaviors 
present in biological systems. This involves, of course, 
identifying in virtue of what biological systems exhibit 
the behavioral characteristics we want to explore. One 
of the biological phenomena of great interest is emo- 
tion. Despite the effort of leading researchers to raise 
the question "whether machines can be intelligent 
without any emotions" (Minsky., 1988), AI interest 
in emotional phenomena has increased only in the 
last decade. An underlying assumption is that many 
cognitive functions, such as memory, attention, learn- 
ing, decision making and planning, are at least partly 
based on emotional mechanisms in biological systems 
(Damasio, 1995). 

One of the qualities of emotional behavior is its 
flexibility (Frijda, 1986), which contrasts with the 
rigidity of stereotyped behaviors such as reflexes or 
habits. Hence, it is relevant to investigate what it is that 
makes emotional behavior flexible. The body, through 
mostly chemical channels, produces diffuse effects on 
the neural system, processes at the root of emotional 
phenomena. Parisi has recently argued that in order "to 
understand the behavior of organisms more adequately 
we also need to reproduce in robots the inside of the 
body of organisms and to study the interactions of the 
robot's control system with what is inside the body" 



(Parisi, 2004), using the term internal robotics to de- 
note the study of the interactions between the (neural) 
control system and the rest of the body. 

Mechanisms that control homeostasis, based on 
hormonal modulation, can motivate appropriate be- 
haviors (Avila-Garcia & Canamero, 2004; Gadanho 
& Hallam, 2001). Emergent behaviors from the inter- 
action of a motivational system with the environment 
maybe called emotional. Canamero 's architecture, for 
example, consists of "a set of motivations; a repertoire 
of behaviors that can satisfy those internal needs or 
motivations as their execution carries a modification 
in the levels of specific variables; and a set of 'basic' 
emotions. "(Canamero, 2005). 

We consider emotional phenomena to emerge from 
a dynamic interaction between internal states, current 
perceptions and environmental relations, such that 
certain neural/physiological states have a close causal 
link with relational situations. This is, in a nutshell, 
the embodied appraisal hypothesis (Carlos Herrera, 
2002; Prinz, 2004). We use two major concepts from 
the dynamical systems (DS) approach to cognition 
(Clark, 1997; Kelso, 1995): collective variables and 
control parameters. In (Carlos Herrera, 2002) we ar- 
gue that internal states can be interpreted as collective 
variables of agent/ environment interaction that allow 
tracing concern-relevant situations. These variables are 
"non-specific: they do not prescribe or contain a code 
for the emerging structure" (Kelso, 1995). They also 
can be considered control parameters, as activation 
in the agent's physiological substrate affects overall 
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action readiness (response, including perceptual and 
cognitive readiness). 



BACKGROUND 

An architecture for the design of emotional appraisal 
and response in artificial agents must take into account 
that emotions bear an intrinsic dynamic relation- 
ship between internal mechanisms, embodiment and 
situation (Frijda, 1993; Lazarus, 1991; Lewis, 2005). 
Emotions are emergent patterns that involve relational 
behavior as well physiological and psychological 
processes. In this section we argue that physiological 
states are essential for understanding emotion appraisal 
and response: they allow to trace agent-environment 
relations, and their modification is a mechanism for 
control of dynamics. 

Appraisal is the process by which an agent is capable 
of recognizing that a situation is relevant to some of its 
concerns. From an information-processing perspective, 
an agent requires the capacity to differentiate situa- 
tions which anticipate that a concern may be at stake 
if no proper response is carried out. Cognitive models 
consider appraisal the product of a reasoning engine 
(Zajonc, 1980), and robotic models often simplify this 
problem by manipulating the environment so that the 
concern-relevance of specific objects/stimuli is particu- 
larly salient (e.g. red color for dangerous objects). 

Appraisal involves categorization, or hot cognition 
(Zajonc, 1980). The theory of embodied appraisal 
argues that the body plays an essential role in structur- 
ing sensory-motor patterns that, once processed by the 
brain, result in appraisal (Damasio, 2000). In the case 
of emotion certain physiological states are indicative of 
concern-relevant situations (Prinz, 2004). A high level 
of adrenaline, for instance, correlates with a wide class 
of emotional situations. The fact that the correlation is 
not one-to-one (physiological states are not sufficient 
to determine emotions) does not imply that they have 
no relationship to interactive relations. We understand 
embodied appraisal as dynamical coupling (attunement) 
in which some internal states are representative (collec- 
tive variables) of agent/environmental interactions. 

But emotion is not only about appraisal, but also 
response. Emotion theorists have proposed the notion 
of action tendency to explain the inherent relational 
purpose of emotional behavior: it establishes or modi- 
fies a relationship between the agent and the world "at 



large" (Frijda, 1986). That means, " [a] ction tendencies 
are hypothesized ... for theoretical reasons: to account 
for latent readiness and to account for behavior flex- 
ibility" (Frijda, 1986). Tendencies imply a direction, 
although they are "not usually guided by a prior goal 
representation" (Frijda, 1986). It is also important to 
distinguish between action tendency and the function 
of emotional behavior. For example, the tendency in 
fear is withdrawal. The function, on the other hand, is 
protection. Similarly, the tendency in shock or surprise 
is interruption of ongoing activity, whilst the function 
is reorientation (Frijda, 1986). Withdrawal can come 
as, for example, freeze, flight, or faint; responses with 
very different functional roles. Hence, emotions are 
far from reflex-like responses. Even though emotion 
responses are often stereotyped and the product of 
evolution, we "should not conceive affect programs as 
fixed and peremptory" (Lazarus, 1991), i.e. "[t]o the 
extent that action programs are fixed and rigid, action 
tendency loses much of its meaning" (Frijda, 1986). 
On the contrary, emotional responses are dynamically 
situated, that is, outward behavior is configured in 
dynamic interaction with the environment. 

For modeling the mechanisms underlying action 
tendencies in biological agents, it should be noted 
how different physiological subsystems are dynami- 
cally interrelated. In particular, certain hormones (e.g. 
adrenaline), can affect, on the one hand, the autonomic 
system, whose activation involves a process of energy 
mobilization, and thus action readiness (Frijda, 1986). 
Hormones also act as neuromodulators, affecting the 
general processing of the nervous system, thus produc- 
ing forms of cognitive or attention readiness. 



MAIN FOCUS OF THE CHAPTER 

This article presents a robotic approach for the emer- 
gence of a coupled agent-environment interaction with 
the ability to: (a) to appraise the concern-relevance of 
the situation, and (b) to control through activation of 
action readiness. The specific mechanisms that allow 
the emergence of such phenomena are based on sen- 
sitivity to overall patterns of interaction through the 
production of hormonal regulation. 

The model illustrated in Figure 1 (Carlos. Herrera, 
2006) is intended to illustrate the relationships between 
nervous system, body and world. The basic feature is 
that a number of internal variables in the body (such as 
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Figure 1. Left: Model of dynamic appraisal in embodied interaction. Relation between (a) physiological states 
and situations. (b,c) nervous system and rest of the body, (e, d) world and body. Right: Proposed controller for 
a Khepera robot (prey). 




Motor neurons 



Neuro modulation 
Hormonal lev 
I s Secretion 

Decay ^B Gland 



oooooooo 



Sensory inputs (si) 



hormonal levels) allow the agent to trace the dynamics 
of some concern-relevant aspects of the relationship to 
the environment. Their processing can be conceived a 
simple form of body maps or feeling (Damasio, 1995, 
2000). 

This activation of internal states is integrated with 
current cognitive, perceptual and sensory-motor pro- 
cesses. The nervous system participates in the homeo- 
static balance, as hormonal production is a function of 
nervous activation - thus the collective variables are 
to some extent control variables: a change produces a 
change in action readiness, reflected in the dynamic 
relationship towards concern-relevant aspects of the 
situation. Sensory-motor activity (relationship of the 
nervous system to the environment), in conjunction 
with further nervous processing (secondary appraisal), 
produces a change in action tendency. Emotional be- 
havior is the result of this process. 

Experimental Setup 

In this section we present a simple, preliminary ex- 
perimental setup to the implementation of this model 
(Carlos. Herrera, 2006). We apply an evolutionary 
robotics approach evolving connection weights in 



the neurocontrollers of simulated Khepera robots in a 
predator/prey scenario inspired by (Nolfi & Floreano, 
1998). Two robots, equipped with infrared sensors, are 
placed in random positions in a square environment 
surrounded by walls. The predator, which also has a 
camera, is rewarded for hitting the prey, which in turn 
is rewarded for avoiding the predator. Both robots are 
controlled by feedforward networks that map sensory 
inputs and to motor outputs/activations. To allow rich 
relational dynamics, the maximum speed of the prey 
is set to twice the predator's. 

We abstractly model some of the functions of the 
endocrine system as a simple gland that secretes one type 
of hormone. The resulting hormonal level is intended to 
be a collective variable of the interaction, i.e. produce 
a function that allows us to trace concern-relevant situ- 
ations. The model also requires that the hormone level 
has an effect on the generation of behavior through 
modulation of the neural controller. In order to achieve 
this, we feed it to the neural controller as an extra input. 
We have established the level of hormonal secretion 
as a function of the activity of the sensory cells and a 
fixed rate of hormonal absorption. 

As mentioned above, the level of hormonal release 
is intended to be a collective variable of the interaction. 
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In order to achieve this, we paid attention to what situa- 
tions are concern-relevant in the prey/predator scenario. 
We are interested in situations in which there is danger 
of being caught. Despite several possible strategies of 
approach and avoidance, given the speed of the prey, 
danger is most present when the prey is caught between 
predator and walls, whereas if no walls interfere, the prey 
can produce optimal escape behaviors. Extrapolating 
this observation, if a robot is near a wall and a predator, 
the sum of the activation of all sensors will be larger 
than when only one of them is present. Therefore it 
makes sense to establish a linear relationship between 
sensory activation and hormonal release, in line with 
intensity theories of emotion that relate emotion elici- 
tation to "densities of neural firing" (Tomkins, 1962) 
(we do not claim intensity theories to be complete, 
though). We therefore define hormonal release as the 
sum of the activation of all other sensors. The decay 
function of the hormone level, or rate of absorption, 
is set to 2, i.e. at every time step, the hormone level 
will be divided by 2. The resulting level is fed back in 
to the neural controller as an extra input, playing the 
role of a parametric bias with a neuromodulatory effect 
on the motor output. The hormonal level can thus be 
expressed as a function of current level and sensory 
states as follows: 



E t . 1+ Z S i 



12 
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The controller weights of the controller and the 
modulation effect of the hormonal level are evolved, 



while the production and absorption of hormones is 
kept fixed. If, as assumed, the level of hormone is sig- 
nificant of a class of situations (danger), then the robot 
can be expected to use it for the evolution of adaptive 
emotional behavior. 

Results 

As with other experiments in co-evolution, performance 
of prey and predator along the evolutionary process 
are co-dependent, and cycles can be observed. It is 
nevertheless possible to analyze cross generational 
strategies and fitness (Nolfi & Floreano, 1998). Due to 
limited space, we will here only analyze the behavior of 
a single generation (100). In this analysis we will verify 
whether: (a) the hormonal level can be considered a 
collective variable of the dynamics of interaction, that 
is, it allows us to track situations in which the prey is 
between wall and predator, (b) the hormonal level acts 
as a parametric bias for the neural controller so as to 
generate an action tendency that changes the relation- 
ship to the environment (whose function is to safeguard 
the prey's concern, i.e. to escape), and (c) the resulting 
behavior shows a degree of flexibility measured as 
robustness in unforeseen circumstances. 

Figure 2 shows the prey's behavior and its rela- 
tionship to the hormone level. A high hormonal level 
modulates the normal behavior (circular), producing 
a straightforward fast motion (right). This change in 
behavior is correlated to dangerous situations (caught 
between predator and wall). The prey thus is capable 
of appraising concern-relevant situations, by means of 
attunement through an appropriate collective variable 



Figure 2. Left: Interaction between predator (black, discontinuous trace) and prey. At the marked points (1, 2, 3, 
4) the prey was caught between wall and predator Center: level of hormonal activation throughout the interac- 
tion. Right: close up of an escape behavior. 
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Table 1. Performance in original and modified environments 





No Obstacles 


One Obstacle 


Two obstacles 


Prey escapes 


56% 


40% 


33% 


Predator kills 


39% 


39% 


36% 


Prey crashes 


5% 


21% 


31% 



represented by a simulated physiological function. 
Further analysis of this behavior and control dynamics 
can be found in (Carlos Herrera, Ziemke, & Moffat, 
2006). 

In terms of flexibility our hypothesis was that, in any 
scenario in which the concern-relevance of the robot 
is represented by such a variable, the prey's ability for 
escape should not be seriously reduced. We tested the 
performance of prey and predator with one and two 
obstacles placed in random positions, for 1000 runs 
(note that the original evolutionary process was carried 
out without obstacles). Given that the evolved predator 
cannot determine whether a prey is hiding behind an 
obstacle, we exclude the runs in which the predator 
crashed without entering the prey's sensor range. 

As the table shows, environmental changes do affect 
the prey's ability to escape. The presence of obstacles 
has a limited effect on the functionality of the prey's 
behavioral strategy for escaping the predator, although 
the proportion of times where the prey crashes increases 
significantly. Comparing the flexibility with that of 
evolved reactive controllers, we found that the rate of 
successful escapes decreases more rapidly in favor of 
predator catches, while crashes remain stable (Carlos. 
Herrera, 2006). We could draw a parallel from such 
results to real emotional behavior: if we react emotion- 
ally to dangerous situations, we will be more likely to 
escape, but also more likely to harm ourselves. Being 
involved in a fast fleeing behavior implies that the 
danger of bumping into further danger increases. This 
is congruent with emotion theory "not all behavior 
elicited by emotional events can be considered coping 
activity . . . Instrumental behavior, too, shows dysfunc- 
tional or non-functional features, among them sheer 
disturbance manifestations like decreased precision 
of skilled movements" (Frijda, 1986). 



FUTURE TRENDS 

Despite increasing interest in the modeling of emotions 
in robotics, it remains one of the cornerstones of Arti- 
ficial Intelligence. In this article, we have presented a 
dynamical-embodied approach. Some of the obvious 
limitations of this initial experimental setup should 
be avoided in further work. For instance, it would be 
interesting to let the control of the ' endocrine gland', 
as well as the modulation of behavior co-evolve or 
co-develop, and to investigate the role evolution and 
learning may play in emotional attunement. This would 
involve research into the capacities of neural networks 
to learn temporal patterns of concern-relevance. We 
have here also considered the physiological system only 
in its relationship to the nervous system, and not in its 
relation to body dynamics, leaving aside the autonomic 
system and its energy mobilization role. This allowed 
us to functionally replace the physiological system 
by a one-dimensional hormone level. More complex 
robotic physiologies, exploring relationships between 
body states and their relationship to sensors, motors, 
and nervous system should be investigated. 

There are two short term experimental goals for 
further work. First, given that it is not always possible 
for the designer to identify concern-relevant situations 
and possible ways to trace them through internal mecha- 
nisms, and in search of increased robotic autonomy, 
we plan to find self-organizing techniques to achieve 
similar results. In particular, we are considering the use 
of evolution and learning, as well as novel mechanisms 
such as anticipatory networks and liquid state machines, 
to allow an internal structure to identify and gain sen- 
sitivity to such situations. A more realistic model of 
the physiological systems involved is also necessary, 
as arguably the control architecture presented here is 
'just' a form of recurrent neural network. Finally, we 
will explore the role of a hormonal regulation system 
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within a framework of behavioral attractors, in order 
to be able to carry out detail dynamical system analysis 
of parametric/behavioral biases and the resulting ac- 
tion tendencies. 



tion and its relation to standard cognitive/perceptual 
mechanisms and representational content. 
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Emotions: Phenomena present in biological systems 
by which an adaptive agent is capable of appraising 
the concern-relevance of situations and provide flex- 
ible responses through generation of physiological, 
cognitive and behavioral readiness. 

Collective Variable / Control Parameter: In 

dynamical systems theory, collective variables allow 
tracing global dynamic patterns, control parameters 
lead the system through such patterns. 

Concerns: The conditions under which a system 
can continue to function. 

Hormonal Modulation: Change in the functionality 
of neural, sensory and motor systems achieved through 
changes in hormonal levels. 

Neuro-Robotics : Approach to robot control through 
the use of neural networks. 



KEY TERMS 



ENDNOTE 



Action Readiness / Tendencies: Physiological 
states affect the readiness for engagement in certain 
dynamics of the interaction 

Artificial Emotion: The attempt to synthesize 
in robots or artificial systems some of the functional 
properties of emotion. 

Embodied Appraisal: Theory that asserts sen- 
sitivity to concern-relevant situations is facilitated 
by physiological and homeostatic mechanisms in an 
embodied agent. 
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INTRODUCTION 

The new paradigm in engineering education demands 
hands-on training of the students using technology ori- 
ented projects. The roots of this approach can be traced 
back to the work of Seymour Papert in 1970s when 
he built a programmable turtle with a reflective light 
sensor (Papert, 1971). His ideas ultimately lead to the 
educational theory of constructionism (Papert, 1986 
and Harel & Papert, 1991). According to this theory, 
students learn very effectively when they are involved 
in the creation of an external object that lives in the 
real world. Learners use this object to think with, and 
to relate ideas of, their subject of inquiry (Bourgoin, 
1990). From an educational point of view, the theory 
of Papert can be linked to the constructivist theory of 
Jean Piaget (Paiget, 1972). According to this theory, 
learning comes from an active process of knowledge 
construction. This knowledge can be gained through 
real life experiences and linked to a learners' previous 
knowledge. The concept of turtle was evolved further 
at MIT and became the famous Programmable Brick 
by Fred Martin who also developed new learning 
environments and methodologies based on this con- 
cept (Martin, 1988 and Martin 1994). The unusual 
idea put forward by the Brick, at least at the time of 
its invention, was the incorporation of the "design" 
work into the learning process. Students were not only 
users in this case, but were actively involved in the 
design process, while solving their problems (Martin, 
1 996a). The 'Brick' was later adopted and incorporated 
by the LEGO MINDSTORMS kit (RCX in 1998 and 
NXT made available in 2006). The use of the name 
"MINDSTROMS" can also be traced back to the book 
by Seymour Papert (Papert 1980). Versions of these 
Bricks for economically challenged communities have 
also been proposed recently (Sipitakiat, et al, 2004). 

The active learning methodology (Harmin and 
Toth, 2006) uses this philosophy of involving students 
in their own learning through class discussions and 
group problem solving and proves to be effective at 
least in certain cases. Robots have become a major 



player in this area and have been employed in improv- 
ing the quality and level of student learning, ranging 
from primary schools to graduate level. As pointed out 
by Resnick and Martin (Resnick and Martin, 1990), 
"Creatures built from Electronic Bricks fall on the 
fuzzy boundary between animals and machines, forcing 
students to come to terms with how machines can be 
like animals, and vice versa". In engineering courses 
incorporating connectionism approach, the students 
are asked to design and program a robot for a specific 
task. They also work in small teams and help and learn 
from each other. 

However it is important to know what is currently 
available to an educator so that he/she can develop the 
required skills, abilities, attitudes and values in students. 
In this article we identify some of the major research 
centres working in the area of education utilizing robots 
and discuss some of the robotic kits now available to 
educators. We also comment on the famous robotic 
competitions worldwide. 



BACKGROUND 

Many researches have tried to include a proj ect-oriented 
approach to the teaching of engineering subjects. This 
approach has the benefit of allowing students to seek in- 
formation on their own while developing a well defined 
product. The use of robots in enhancing the quality of 
education at a university level has been discussed by 
many authors (Takahashi et al, 2006, Gage & Murphy 
2003, Matsushita et al 2006). Students from school to 
undergraduate level have been involved in microcon- 
troller based robotic projects. They can design, build 
and test their robots themselves and that helps them 
later in their education. Mukai and McGregor (Mukai 
and McGregor, 2004) have gone to the level of teaching 
control to eight graders in public schools. 

Robots can help educators in teaching and learners 
in learning various branches of basic sciences. This is 
in addition to their obvious use in engineering courses. 
Mathematics (Algebra, geometry, matrices, calculus), 
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Physics (electricity, force, Newton's laws, momentum, 
rotations and angular momentum) are few examples 
(Yousuf et al. , 2006). Their connection to biology comes 
through understanding and linking of human sensors to 
robotic or electronic sensors. Bratzel (Bratzel, 2005) 
uses engineering principles to teach physics and physical 
science by incorporating LEGO robots. She introduces, 
in chapters of increasing difficulty, concepts of motion, 
forces, fluids, stability, work and energy, etc. Bratzel 
has also correlated the activities in her book with the 
national science content (in USA) standards for grades 
5-12, and hence makes for a good choice for educators 
at that level. 

The purpose of integrating robotics is not just to 
create excitement among students but to use this excite- 
ment to help them in learning what they find difficult 
to learn using conventional methods. All the educators 
want to develop certain abilities, values and attitudes 
among their students. Some of the international ac- 
crediting organizations, like the Accreditation Board 
for Engineering and Technology recommend the use 
of a "competency-based learning" methodology for 
course development (Earnest, 2005 and Criteria for 
Accrediting Engineering Programs, 2006). The core 
of this system is that all activity (in the classroom, 
laboratory or projects) must be focused on pre stated 
competencies by using structured learning objectives. 
This system also demands the evaluation to be based 
on the competencies developed by the students. This 
can only be done by looking at concrete evidence (e.g., 
electrical or mechanical systems developed, software, 
technical reports, etc). Once again, the use of robots 
provides the educator with well defined competencies 
to be evaluated precisely. 



EDUCATIONAL ROBOTS 

We divide this part into subsections discussing various 
aspects of the main theme. Since each of these subjects 
is sufficiently broad in itself, we concentrate on a few 
representative cases only. 

Research Groups in the Area of 
Educational Robotics 

This is an area of immense activity and the number of 
research groups active in this area is extremely large. 
Almost all the robotic research groups have an interest 



in the educational aspects of the subject. Many have 
tried to involve their own undergraduate students into 
the process and have gained new and deeper insights 
into student behaviour and learning. 

Massachusetts Institute of Technology 

MIT has been active in the area of robotics for a long 
time but it was in 1 989 that Fred Martin (Martin 200 1 ) 
started a worldwide movement in educational robotics 
by introducing his now famous undergraduate design 
contest. This was also the launching of the correspond- 
ing robot "brain" called the Handyboard (Handyboard). 
It is now being used by educators worldwide together 
with the Interactive C language to program the system 
(Butler et. al. , 2006). This system is powerful enough to 
have industrial applications. The work of Fred Martin 
was continued later with Mitchel Resnick's Life-Long 
Kindergarten group (Kindergarten Group). This work 
was partly sponsored by the LEGO Group and became 
the foundation for the LEGO MINDSTORMS Robotics 
Invention System (to be discussed later). 

NASA Robotics Alliance Project 

The Robotics Alliance Project is an initiative of the 
NASA, the National Aeronautics and Space Admin- 
istration in USA. It is based on the idea that NASA is 
going to need many more robot engineers for its space 
endeavours in the future and the only way to have qual- 
ity engineers in the future is to invest in their training 
(RAP). Hence the project starts at the level of K-12 
and does it through a variety of robotics programs, 
competitions and curriculum development. Their web 
site offers links to curriculum resources starting from 
primary to doctoral level. It also lists some of the major 
robotic competitions, and internship opportunities for 
students, etc. NASA also provides video and webcast 
archives for educators. 

Carnegie-Mellon University 

The robotics institute at the Carnegie-Mellon Uni- 
versity is one of the largest of its type and has various 
projects with an educational impact. The CREATE 
project, which is an acronym for Community Robotics, 
Education and Technology Empowerment (CREATE) 
has research programs in curriculum design for teaching 
robot programming at the secondary school level and 
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beyond. They are also developing curriculum that will 
help middle andhigh school educators. Another valuable 
contribution by the same group is the development of 
a fully accredited robotic exploration course for high 
school juniors. The course is offered in summer and 
allows students to build robots using special fast-build 
kits. These kits have also been designed at CMU and 
include even a vision system allowing students to 
develop rover missions in the classroom and home 
environments (Nourbakhsh et al., 2005). Students then 
go home together with the robot. This way student can 
keep working on the subject after leaving the center. 

Fraunhofer Institut Intelligente Analyse und 
Informationssyteme 

The Fraunhofer AIS in Germany (Fraunhofer), 
sponsored by the Federal Ministry of Education and 
Research, is active in the educational aspects of ro- 
botics. They have developed a robot called Roberta, 
which conveys the knowledge about engineering and 
computer science to youngsters in an exciting way. Their 
particular focus is female population. Dozens of tutors 
have been guided in the use of this methodology and 
a few hundred students (more than three-fourth girls) 
have been trained. Tutors get training at the Fraunhofer 
AIS with specially developed teaching material to sup- 
port learning. A national network of regional Roberta 
centers is being established to support tutors locally, 
to ensure nation-wide exchange of experience, and to 
disseminate the results of this project. 

Educational Robot Kits 

Educational robotic kits provide the users everything 
needed to build and program a robot. Some of them are 
more flexible than others, but each comes with its own 
programming language or programming environment. 
We discuss here some of them. 

LEGO MINDSTORMS RCX and 
NXT Robots 



university level. Banking upon the students' familiar- 
ity with LEGO (those who are not familiar need very 
little extra time to start with), and utilizing a specially 
developed, highly visual programming language, the 
system helps kids from six years upward to learn 
and enjoy robotics. The system comes complete with 
online tutorials and is backed by innumerable web 
sites, books and tutorials. The newer version is called 
MINDSTORMS NXT and is even more flexible and 
powerful with some new sensors added and some of 
the older sensors upgraded. The new servo motors 
have been fitted with rotation sensors, allowing precise 
position control. 

Parallax Systems 

Parallax, Inc., (Parallax) is a developer of electronic 
systems (including robots) generally for higher level 
students though they do have systems for ages eight 
and above. Scribbler™ Robot for example, is meant for 
first-time programmers and roboticists age eight and up . 
The more advanced systems designed for experienced 
users include the Toddler® Robot, QuadCrawler and 
HexCrawler robots. The big advantage for educators 
is the large number of books, manuals and curriculum 
material available for these systems. Most of it is avail- 
able free from the company web site and / or included 
in the kits. 

Fishcertechnik Reconfigurable Robot Kits 

The Fischertechnik systems (Fischerwerke Artur 
Fischer Gmbh) developed in Germany, are some of 
the most advanced robotic invention systems avail- 
able. These systems allow students of all ages (even 
adults) to enj oy the field with flexible robotic kits. They 
also provide curriculum material for those willing to 
incorporate Fischertechnik into their classes. These 
systems are easy to program and come with a variety 
of sensors, motors, LEDs, etc. The kits can be used to 
teach advanced concepts in engineering too, including 
PLCs or Programmable Logic Control. 



As mentioned in the introduction above, it was an idea 
developed at MIT and introduced to the mass market 
by the LEGO Group (LEGO) in 1998. The LEGO 
MINDSTORMS robots are perhaps the simplest kits 
to start with, yet they are general and broad enough to 
be used as pedagogic platforms for training even at the 



Robotic Competitions 

Robotic competitions are an ideal way to keep the 
interest of the students alive and to give them a well- 
defined target to achieve. Most of the competitions also 
give very strict guidelines as to what can be used in 
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the construction of the system and who can participate. 
Many of the international competitions discourage 
teacher participation in the final presentation and hence 
allow students to grow and develop into self responsible 
persons. These competitions also grade students based 
on their group work, cleanliness, and presentation skills. 
In a nutshell, they are an excellent way of "standard- 
izing" curriculum and assessment. 

According to the Manchester based organization, 
"For Inspiration and Recognition of Science and 
Technology (FIRST), the FIRST Robotics Competi- 
tion involves around 32,500 students in 2007. The 
junior version, called the FIRST LEGO League has 
been designed for children in age group 9-14 years. 
An estimated number of 88,000 children participated 
in this activity in 2006. There are dozens of other local 
robotic competitions all over the world for which there 
is no statistics available. However a search on any of 
the Internet search engines brings a large number of 
pages (e.g. the Robot Competition FAQ). Many of these 
are confined to a university, college or school. But in 
many cases the models can be followed and replicated 
at other places. The famous 6.270 Autonomous Robot 
Design Competition at MIT (MIT 6.270) is a good ex- 
ample and has been running successfully for more than 
two decades. Most of the research groups mentioned 
above may also be contacted as they frequently arrange 
national level competitions. 



FUTURE TRENDS 

The field of educational robots is full of promising 
directions. One important factor is the development of 
generic robot systems and the standardization of cor- 
responding robot control software. Microsoft Robotics 
Studio (Microsoft Robotics Studio) is one of the most 
recent efforts in the direction of software standardiza- 
tion. It allows ANY robot to be controlled through a 
single platform. Companies like Parallax have already 
developed free examples for users to try on their boe- 
bot. On the hardware side also we are going to see more 
modular systems with flexibility and extensibility. The 
MIT Tower (Lyon, 2003) is a typical product of this 
type. Though not yet commercially available, it allows 
the user to start with a basic system and then to go on 
adding functionality depending upon requirements. 
Currently available modules are for sensing, actua- 
tion, data storage, and infrared communication. They 



plan to add new ones for enhanced display output and 
high-speed wireless communication, etc. 



CONCLUSION 

In this article we have made an attempt to provide a 
brief overview of the state of the art in educational 
robotics, the work of major research groups and the 
offerings of commercial vendors for educators. Com- 
ments on the future directions in educational robotics 
have been made. Most of the efforts in this direction 
have started to take advantage of the experiences of 
others and we hope to see more balanced and well 
though-out curricula to be developed for each of the 
major areas of basic sciences. 
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KEY TERMS 



Competency Based Learning: A system of learn- 
ing which represents a dynamic mixture of knowledge, 
understanding, capacity and ability. The competencies 
are measurable outcomes of learning and hence can be 
evaluated at the end of the process. 

Constructionsim: According to the constructionist 
learning theory, people learn most effectively when they 
are involved in the creation of an external artefact in the 
world. This artefact becomes an "object to think with," 
which is used by the learner to explore and embody 
ideas related to the topic of inquiry (Martin, 1996b). 

Constructivism: An educational theory or school of 
learning, based on the idea that knowledge is constructed 
by the learner based on mental activity. Learners create 
a mental image of how the world operates and they 
adapt and transform their understanding using their 
earlier knowledge. 

Industrial Robotic Manipulators: Mechanical 
arms used in industry, with sensor feedback and auto- 
matic control software. 

Mobile Robots: Robots with the capability to move 
autonomously from one place to the other, including 
wheeled, legged, submerged and flying robots, etc. 

Pedagogy: The art (or science) of being a teacher 
but commonly referred to as the technique used in 
instruction. 



Active Learning: The methodology which demands 
students to participate actively in their own learning, 
guided and supervised by the educator. 
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INTRODUCTION 

Feedforward neural networks (FFNs) are often con- 
sidered as universal tools and find their applications in 
areas such as function approximation, pattern recogni- 
tion, or signal and image processing. One of the main 
advantages of using FFNs is that they usually do not 
require, in the learning process, exact mathematical 
knowledge about input-output dependencies. In other 
words, they may be regarded as model-free approxima- 
tors (Hornik, 1989). They learn by minimizing some 
kind of an error function to fit training data as close as 
possible. Such learning scheme doesn't take into ac- 
count a quality of the training data, so its performance 
depends strongly on the fact whether the assumption, 
that the data are reliable and trustable, is hold. This is 
why when the data are corrupted by the large noise, 
or when outliers and gross errors appear, the network 
builds a model that can be very inaccurate. 

In most real-world cases the assumption that er- 
rors are normal and iid, simply doesn't hold. The data 
obtained from the environment are very often affected 
by noise of unknown form or outliers, suspected to be 
gross errors. The quantity of outliers in routine data 
ranges from 1 to 10% (Hampel, 1986). They usually 
appear in data sets during obtaining the information and 
pre-processing them when, for instance, measurement 
errors, long-tailed noise, or results of human mistakes 
may occur. 

Intuitively we can define an outlier as an observa- 
tion that significantly deviates from the bulk of data. 
Nevertheless, this definition doesn't help in classifying 
an outlier as a gross error or a meaningful and impor- 
tant observation. To deal with the problem of outliers 
a separate branch of statistics, called robust statistics 
(Hampel, 1986, Huber, 1981), was developed. Robust 
statistical methods are designed to act well when the 
true underlying model deviates from the assumed 
parametric model. Ideally, they should be efficient and 
reliable for the observations that are very close to the 



assumed model and simultaneously for the observations 
containing larger deviations and outliers. 

The other way is to detect and remove outliers 
before the beginning of the model building process. 
Such methods are more universal but they do not take 
into account the specific type of modeling philosophy 
(e.g. modeling by the FFNs). In this article we propose 
new robust FFNs learning algorithm based on the least 
trimmed squares estimator. 



BACKGROUND 

The most popular FFNs learning scheme makes use of 
the backpropagation (BP) strategy and a minimization 
of the mean squared error (mse). Until now, a couple 
various robust BP learning algorithms have been pro- 
posed. Generally, they take advantage of the idea of 
robust estimators. This approach was adopted to the 
neural networks learning algorithms by replacing the 
mse with a loss error function of such a shape that the 
impact of outliers may be, in certain conditions, reduced 
or even removed. 

Chen and Jain (1994) proposed the Hampel's hy- 
perbolic tangent as a new error criterion, with the scale 
estimator p that defines the interval supposed to contain 
only clean data, depending on the assumed quantity of 
outliers or current errors values. This idea was combined 
with the annealing concept by Chunag and Su (2000). 
They applied the annealing scheme to decrease the value 
of P, whereas Liano (1996) introduced the logistic er- 
ror function derived from the assumption of the errors 
generated with the Cauchy distribution. In a recent 
work Pernia-Espinoza et al. (2005) presented an error 
function based on tau-estimates. An approach based on 
the adaptive learning rate was also proposed (Rusiecki, 
2006). Such modifications may significantly improve 
the network performance for corrupted training sets. 
However, even these approaches suffer from several 
difficulties and cannot be considered as universal (also 



Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. 



Robust Learning Algorithm with LTS Error Function 



because of properties of applied estimators). Besides, 
very few of them have been proposed until today and 
they exploit the same basic idea, so we still need to 
look for new solutions. 



ROBUST LTS LEARNING ALGORITHM 
Least Trimmed Squares 

The least trimmed squares estimator (LTS), introduced 
by Rousseuw (1984, 1985) is a classical high break- 
down point robust estimator, similar to the slower 
converging least median of squares (LMS) (Rousseuw, 
1 984). The estimator and its evaluations are often used 
in linear and nonlinear regression problems, in sensi- 
tivity analysis, small-sample corrections, or in simple 
detecting outliers. The main difference between the 
LTS estimator and the least sum of squares, but also 
M-estimators, is obviously the operation performed on 
residuals. In this case however, robustness is achieved 
not by replacing the square by another function but by 
superseding the summation sign with something else. 
The nonlinear least trimmed squares estimator is then 
defined as: 



9 = argmin YYr I 



(1) 



where (r2)l:n<. . .<(r2)n:n are the ordered squared re- 
siduals n'2(0)={yz-r|(xz,0)}2,y / represents the dependent 
variable, x z =(x zl ,...,x^) the independent input vector, 
and QeRP denotes the underlying parameter vector for 
the general nonlinear regression model. The trimming 
constant h must be chosen as n/2<h<n to provide that n-h 
observations with the largest residuals do not directly 
affect the estimator. Under certain assumptions the esti- 
mator should be robust not only to outliers (Stromberg, 
1992) but also to the leverage points (grossly aberrant 
values of x z ) (Rousseuw, 1987). 

Derivation of the LTS Algorithm 

For simplicity, let us consider a simple three layer 
feedforward neural network with one hidden layer. The 
net is trained on a set of n training pairs: 

{(x 1 ,t 1 ),(x 2 , t 2 ),...,(x n , t n )}, 



where x t eRP and t z ei^. For the given input vector 
x z =(x zl ,x /2 ,. . .,x /p ) T , the output of the 7th neuron of the 
hidden layer may be obtained as: 



P \ , V 

z tj = fi H w jk x ik~ b j = fik n Pij) 



\k=l 



, for 7=1,2,...,/, 



(2) 



where /!(•) is the activation function of the hidden 
layer, wjk is the weight between the kth net input and 
7*th neuron, and bj is the bias of the 7th neuron. Then 
the output vector of the network yz=(yzl,yz2,. . .,yiq)T 
is given as: 



y iv = U 



5>* 



z«-b' 



W=i 



= flfaPiv) 



,forv=l,2,...,q. 
(3) 



Here f 2 (-) denotes the activation function, w' • is 
the weight between the vth neuron of the output layer 
and the 7th neuron of the hidden layer, and b\ is the 
bias of the vth neuron of the output layer. Now, we 
introduce the robust LTS error criterion, based on the 
Least Trimmed Squares estimator. The new error func- 
tion is defined as: 



= tk 2 \ 



(4) 



In this case, (r 2 ) 1:n <...<(r 2 ) n:n are ordered squared 
residuals of the form 



ri 2 ={Zl(y iv -t iv )|} 2 



(5) 



The trimming constant h must be carefully chosen 
because it is responsible for the quantity of patterns 
suspected to be outliers. 

We assume, for simplicity, that weights are updated 
according to the gradient-descent learning algorithm 
but this can be extended to any other gradient-based 
algorithm. Then to each weight is added (a denotes a 
learning coefficient): 



Aw # =-a dE LTS /dw jk , 



(6) 
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Aw'vp-a SE LTS /dw' vj , 

where 

dr i /dwj k =f 2 (inp iv )w' yj f 1 (inp ij )x ik , 

and 

dr/dw'^ =f 2 (inp iv )z ir 



(7) h=|{r,:|r,|<c*median(|r,|), i=l...n}\, 



(11) 



(8) 



0) 



The main problem that may occur here is calculat- 
ing the E LTS derivative. It is not continuous and it can 
be written as: 



^ 2 ), 



dr, 




(10) 



As it was experimentally demonstrated, such shape 
of the derivative function is smooth enough for the BP 
learning algorithm. 

In the use of robust learning algorithms, there ex- 
ist some problems, concerning mainly the choice of a 
starting point for the method. In fact, we can divide it 
into two tasks: choosing initial network parameters, 
and choosing the right scale estimator. If the initial 
weights of the network are not properly selected, the 
learning process may move in the wrong direction 
and the algorithm may stack in a local minimum. In 
this case the network performance might become very 
poor. The scale estimator or its equivalent (here, the 
trimming constant h) is responsible for the amount of 
outliers that are to be rejected during the training, it's 
clearly evident then, that if h is incorrect, gross errors 
may be regarded as good data and desired points may 
be discriminated. 

Following (Chen and Jain, 1994), we decided to 
use our LTS robust algorithm after a period of train- 
ing by the traditional BP algorithm to set the initial 
parameters. We proposed two strategies of choosing 
the trimming parameter h. In the first approach we as- 
sumed a predefined value of h, depending on expected 
percentage of outliers in the training data (LTS1). In 
this case, additional a-priori knowledge of the error 
distribution is needed, so the strategy is not very useful. 
The second approach (LTS2) is to choose h by using 
the median of all errors as: 



where c=1.483 for the MAD scale estimate (Huber, 
1981). Errors used for calculating h were the errors 
obtained after the last epoch of the traditional back- 
propagation algorithm, so the value of h is set constant 
for the training process. 

Simulation Results 

The LTS learning algorithm was tested on function ap- 
proximation tasks. In this paper we present only a few 
of many different testing situations. The first function 
to be approximated is y=x~ 2/3 proposed by Chen and 
Jain ( 1 994), the second one is a two-dimensional spiral 
given as x=siny, z=cosy. 

To simulate real data containing noise and outliers 
we used different models, defined as follows: 

Clean data without noise and outliers; 
Data corrupted with the Gross Error Model: 
F=(1-S)G+SH, where F is the error distribution, 
G~JV(0.0,0.1) and H~JV(0.0,10.0) are Gaussin 
noise and outliers and occur with probability 1-8 
and 8 (data Type 1); 

Data with high value random outliers (Type 2), 
proposed in (Pernia-Espinoza et al., 2005) of 
the form F=(1-8)G+8(H 1 +H 2 +H 3 +H 4 ), where 
H!~iV(15,2), H 2 ~N(-20,3), H 3 ~IV(30,1.5), 
H 4 ~iV(-12,4). 
• Data with outliers generated from the Gross Er- 
ror Model, injected into the input vector x z (Type 
3). 

The performances of the traditional backpropagation 
algorithm (BP), robust LMLS algorithm, and the both 
variations of the novel robust LTS algorithm, LTS1 
and LTS2, were compared. 

Looking at the Table 1 we can see that for the clean 
data of the first task, all algorithms act relatively well. 
For the data containing gross errors, the two variations 
of the LTS present the best performance and it is hard 
to say, which of them is better, while for the data with 
high value outliers only LTS2 and LMLS ensure good 
fitting to the testing data, while LTS 1 , though still better 
than the BP algorithm, acts rather poor. 

For the data containing outliers injected into input, 
the algorithms LTS1 and LTS2 presented the best per- 
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Table 1. The mean MSEfor the 100 trials for the networks trained to approximate function of one variable 



Clean Data with gross Data with high value Data with gross 
Data errors (Type 1) outliers (Type 2) errors in the input 

vector (Type 3) 


Algorithm 


6=0.0 


6=0.1 


6=0.2 


6=0.1 


6=0.2 


6=0.1 


6=0.2 


BP 


0.0007 


0.0398 


0.0809 


1.7929 


4.0996 


0.0140 


0.0180 


LMLS 


0.0007 


0.0061 


0.0088 


0.0050 


0.0053 


0.0151 


0.0177 


LTS1 


- 


0.0054 


0.0056 


0.0632 


0.1454 


0.0104 


0.0120 


LTS2 


0.0013 


0.0049 


0.0067 


0.0051 


0.0061 


0.0112 


0.0149 



Table 2. The mean MSEfor the 100 trials for the networks trained to approximate two-dimensional spiral 



Clean Data Data with gross errors Data with Data with gross errors 
(Type 1) high value in the input vector 
outliers (Type 3)) 
(Type 2) 


Algorithm 


6=0.0 


6=0.1 


6=0.2 


6=0.1 


6=0.1 


6=0.2 


BP 


0.0000 


0.3967 


0.7722 


24.9154 


0.0014 


0.0057 


LMLS 


0.0000 


0.0584 


0.1442 


0.0682 


0.0006 


0.0034 


LTS1 


- 


0.0318 


0.0390 


1.7108 


0.0001 


0.0023 


LTS2 


0.0006 


0.0284 


0.0534 


0.0311 


0.0007 


0.0023 



Figure 1. Simulation results for the network trained to approximate one dimensional function (data Type 1): 
backpropagation algorithm (dash- dot line), LMLS alg. (dashed line), LTS1 alg. (dotted line), LTS2 alg. (solid 
line) 




1 1.5 2 
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Figure 2. Simulation results for the network trained to approximate one dimensional function (data Type 3): 
backpropagation algorithm (dash- dot line), LMLS alg. (dashed line), LTS1 alg. (dotted line), LTS2 alg. (solid 
line) 





Figure 3. Simulation results for the network trained to approximate two-dimensional spiral (data Type I): 
backpropagation algorithm (dash- dot line), LMLS alg. (dashed line), LTS1 alg. (dotted line), LTS2 alg. (solid 
line) 
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formance and the error of the LTS1 is even over 25% 
better than for the Lmls and BR 

Results obtained for the second approximation task 
are, generally, similar. For the data containing outliers, 
the superiority of the LTS algorithm is clearly evident. 
The LTS2 acts well also for the high value outliers, 
showing the lowest error. Besides, for gross errors in 
the input vector also the LTS1 and LTS2 appear to be 
the best. 

To summarize, one can notice that both LTS al- 
gorithms showed performance better than other two 
algorithms, for the data containing gross errors in the 
input, as well as in the output vector. 



FUTURE TRENDS 

Potentially, robust learning algorithms, based on modi- 
fied error function, can be designed also for training 
other types of NN structures, such as recurrent or self- 
organizing networks. Moreover, for the FFNs there is 
plenty of techniques (adaptive learning rate, transfer 
functions, etc.) that can be used to make their learning 
process more robust to outliers. 



CONCLUSION 

In this paper a novel robust LTS learning algorithm 
was proposed. As it was experimentally demonstrated, 
it behaves better than traditional algorithm, and ro- 
bust Lmls algorithm, in the presence of outliers in 
the training data. Moreover, it is simultaneously the 
first robust learning algorithm that takes into account 
also gross errors injected into the input vector of the 
training patterns (leverage points). Especially in its 
second version (LTS2), with median error used to set 
the trimming constant h, it can be considered as simple 
and effective mean to increase learning performance 
on the contaminated data sets. It doesn't need any 
additional a-priori knowledge of the assumed error 
distribution to ensure relatively good training results 
in any conditions. The robust LTS learning algorithm 
can be easily adapted to many types of neural networks 
learning strategies. 
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KEY TERMS 

Feedforward Neural Networks: Artificial NN 
consisting of units arranged in layers with only forward 
connections to units in subsequent layers. 

Gross Errors: Large value errors, often caused by 
human mistakes, measurement errors, etc. 



Leverage Points: Grossly aberrant values of mea- 
sured or assumed system inputs 

Outlier: Observation that is significantly different 
from majority of data 

Robust Estimator: Estimator able to classify data 
into outliers and clean observations, and to find a rea- 
sonable fit to the bulk of data. 

Robust Learning Algorithm: NN learning al- 
gorithm that can act well even if outliers or leverage 
points are present in training sets 

Robust Statistics: Part of statistics developing 
methods that should give useful results when certain 
assumptions (for example of iid light tailed errors) 
are relaxed 
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INTRODUCTION 

Neuro-fuzzy hybridization is the oldest and most popu- 
lar methodology in soft computing (Mitra & Hayashi, 
2000). Neuro-fuzzy hybridization is known as Fuzzy 
Neural Networks, or Neuro-Fuzzy Systems (NFS) 
in the literature (Lin & Lee, 1996; Mitra & Hayashi, 
2000). NFS is capable of abstracting a fuzzy model 
from given numerical examples using neural learning 
techniques to formulate accurate predictions on unseen 
samples. The fuzzy model incorporates the human-like 
style of fuzzy reasoning through a linguistic model that 
comprises of if-then fuzzy rules and linguistic terms 
described by membership functions. Hence, the main 
strength of NFS in modeling data is universal approxi- 
mation (Tikk, Koczy, & Gedeon, 2003) with the ability 
to solicit interpretable if-then fuzzy rules (Guillaume, 
2001). However, modeling data using NFS involves 
the contradictory requirements of interpretability versus 
accuracy. Prevailingly, NFS that focused on accuracy 
employed optimization which resulted in member- 
ship functions that derailed from human-interpretable 
linguistic terms, or employed large number of if-then 
fuzzy rules on high-dimensional data that exceeded 
human level interpretation. 

This article presents a novel hybrid intelligent 
Rough set-based Neuro-Fuzzy System (RNFS). RNFS 
synergizes the sound concept of knowledge reduction 
from rough set theory with NFS. RNFS reinforces the 
strength of NFS by employing rough set-based tech- 
niques to perform attribute and rule reductions, thereby 
improving the interpretability without compromising 
the accuracy of the abstracted fuzzy model. 



BACKGROUND 

The core problem in soft computing is about bridging 
the gap between subjective knowledge and objective 



data (Dubois & Prade, 1998). There are two approaches 
of addressing this problem; namely, modeling data 
in which a function is built to accurately mimic the 
data, and abstracting data in which a system is built to 
produce articulated knowledge preferably in natural 
language form (Dubois & Prade, 1998). The emphasis 
of the former is on the ability to reproduce what has 
been observed. Neural networks with their prominent 
learning capabilities inspired from biological systems 
are highly suitable in this approach. On the other hand, 
the emphasis of the latter is on the ability to explain 
the data in a human interpretable way. Fuzzy systems 
with the ability of modeling linguistic terms that are 
expressions of human language are likewise highly 
effective in this approach. In fuzzy systems, linguistic 
expressions are formulated from explicit knowledge 
in the form of if-then fuzzy rules where the linguistic 
terms of the antecedents and consequents are fuzzy 
sets. However, the parameters of these linguistic ex- 
pressions are sometimes difficult to specify and have 
to be manually tuned. In contrast, although neural 
networks are capable of learning from data, they are 
black box models and thus soliciting knowledge from 
neural networks is not a straightforward task. Hence, 
a neural network is capable of modeling data, but a 
user cannot learn from it. On the other hand, a user 
can learn from a fuzzy system, but it is not capable of 
learning from data. 

Neuro-fuzzy hybridization synergizes these two 
techniques by combining the human-like reasoning 
style of fuzzy systems with the learning and connec- 
tionist structure of neural networks. Thus, Neuro-Fuzzy 
Systems (NFS) are gray-box models that are capable 
of abstracting a fuzzy model from given numerical 
examples using neural learning techniques. Hence, a 
Neuro-Fuzzy System learns and at the same time, a user 
can learn from it as well. However, the use of NFS in 
abstracting data involves two contradictory require- 
ments in fuzzy modeling: interpretability versus accu- 
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racy (Casillas, Cordon, Herrera, & Magdalena, 2003). 
In practice, only one of the two properties prevails. 
Hence, they can be classified as linguistic NFS that are 
focused on interpretability, mainly using the Mamdani 
model (Mamdani & Assilian, 1975); and precise NFS 
that are focused on accuracy, mainly using the Takagi- 
Sugeno-Kang model (Takagi & Sugeno, 1985). 

Prevailing research on modeling data using linguistic 
NFS focused on increasing accuracy as much as possible 
but neglected interpretability (Casillas et al., 2003). 
Existing linguistic NFS such as FALCON (Lin & Lee, 
1996), POPFNN (Quek & Zhou, 2001) and GenSoFNN 
(Tung & Quek, 2002) employ the hybrid learning ap- 
proach to abstract model from numerical data. In this 
approach, clustering is used in the first stage to generate 
the membership functions and competitive learning is 
used to identify the if-then fuzzy rules; followed by 
supervised learning that uses backpropagation in the 
final stage to optimize the membership functions. The 
unconstrained optimization in the final stage increases 
the accuracy of the abstracted model, but it resulted 
in membership functions that are derailed from hu- 
man-interpretable linguistic terms (de Oliveira, 1999). 
Although the definition of interpretability and its criteria 
is subjected to controversial discussion, interpretable 
linguistic variables is often associated with the shape 
and mutual overlapping of the membership functions 
(Mikut, Jakel, & Groll, 2005). Nevertheless, formal 
definition on the semantic properties of interpretable 
linguistic variables were proposed (Mikut et al., 2005; 
de Oliveira, 1999); namely, coverage, normalized, 
convex and ordered. Interpretability is vital to NFS in 
modeling data because if neglected, they degenerate into 
black-box models in which the advantages over other 
methods such as neural networks are lost (Casillas et 
al., 2003; Mikut et al., 2005). Therefore, abstracting a 
fuzzy model that is not humanly interpretable derails 
the fundamental purpose of using NFS. 

In addition, a large number of if-then fuzzy rules are 
required to model high dimensional data, which in turn 
exceeds the human interpretation capacity (Casillas et 
al., 2003). This interpretability issue on large number 
of if-then rules motivates the complexity reduction of 
NFS. This is similar to the problems encountered by 
numerical data driven techniques in data mining (Han 
& Kamber, 2001). These techniques rely on heuristics 
to guide or reduce their search space horizontally or 
vertically (Lin & Cercone, 1997). Horizontal reduc- 
tion is realized by the merging of identical data tuples 



or the quantization of continuous numerical values 
while vertical reduction is realized by feature selection 
methods. In Linguistic NFS, the former corresponds 
to the conversion of numerical inputs from a continu- 
ous range to a finite number of linguistic terms using 
membership functions while the latter corresponds to 
fuzzy if-then rule pruning and reduction. In some exist- 
ing linguistic neuro-f uzzy systems, vertical reduction is 
employed by identifying fewer if-then fuzzy rules using 
certain heuristic threshold (Quek & Zhou, 200 1 ), or by 
applying pruning based on certainty factors (Tung & 
Quek, 2002). However, if the number of if-then rules 
is bounded as a practical limitation through the use of 
heuristic thresholds, then the universal approximation 
property is lost (Moser, 1999). 

Recently, rough set theory (Pawlak, 1991), one of the 
methodologies in soft computing, has shown to provide 
efficient techniques of finding hidden patterns in data 
(Pawlak, 2002). Rough set-based methods have shown 
the potential for feasible feature selection with the abil- 
ity to significantly reduce the pattern dimensionality in 
neural networks. This motivates the synergy of rough 
set-based methods with NFS to increase the interpret- 
ability of the abstracted model without compromising 
the accuracy. 



ROUGH SET-BASED NEURO-FUZZY 
SYSTEM 

This article presents the hybrid intelligent Rough set- 
based Neuro-Fuzzy System (RNFS) (Ang & Quek, 
2006a), which synergizes the sound concept of knowl- 
edge reduction in rough set theory with the human-like 
reasoning style of fuzzy systems and the learning and 
connectionist structure of neural networks. Details on 
the architecture and learning process of the RNFS are 
described in the following sections. 

Architecture of RNFS 

The architecture of the RNFS is a five-layer neural 
network shown in Figure 1 . Its architecture is devel- 
oped using the Pseudo Outer-Product based Fuzzy 
Neural Network using the Compositional Rule of 
Inference and Singleton fuzzifier (POPFNN-CRI(S)) 
(Ang, Quek, & Pasquier, 2003) as a foundation. For 
simplicity, only the interconnections for the output y m 
are shown. Each layer in RNFS performs a specific 



1397 



Rough Set-Based Neuro-Fuzzy System 



Figure 1. Architecture of rough set-based neuro-fuzzy system 
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fuzzy operation, so the nodes and operations of each 
layer are noted with a superscript of / to V for clarity. 
The inputs and outputs of the RNFS are represented 

as non- fuzzy vector X = \x v x 2 ,...,x.,...,x n ] and non- 
fuzzy vector Y = [y 1? y 2 ,...,y m ,...,y n ^ respectively. The 
fuzzification of the inputs and the defuzzification of 
the outputs are respectively performed by the input 
and output linguistic layers respectively while the 
reasoning mechanism is collectively performed by the 
condition, rule and consequence layers. Each rule node 
R™ in Figure 1 is linked to only one input-label node 
from each input node and only one output-label node 
from each output node. The links of rule nodes to the 
antecedent in the condition layer and the consequent in 
the consequence layer are mathematically denoted as 

sets (C 7 ,D y ) . The antecedent of rule nodes is represented 

by a set C /= {c 1 ,c 2 ,...,c.,...,c n } where c. is a condition 

variable such that c. e IL # . The set of condition labels 

i i 

c. can assume is semantically represented by the set of 
input-label nodes ILf = ^Lf 1 JLf 2 ,...JLf J ,...JLf J } 
but a computational notation of IL # = {l,2,...j,...,J^ 
is used in this article. Similarly, the consequent of rule 

nodes is represented by a set D y = {d 1 ,d 2 ,...,d m ,...,d n } 



where d is a consequent variable such that d gOL^. 

m ~ mm 

The set of consequent labels d m can assume is seman- 
tically represented by the set of output-label nodes 
OL'I = {OL- 1 ,OLZ 2 ,...,OL ! : j ,..,OLZ Lm } but a com- 
putational notation of OL,J^= {1, 2, ..., h-,L m } is used in 
this article. The specific links of a rule node R^ 11 to the 
antecedent in the condition layer and the consequent in 
the consequence layer is denoted as (c J k ,D v k \ 
The novel characteristics of RNFS are: 

Vertical reduction of if-then fuzzy rules - is per- 
formed in RNFS in which the Rough Set-based 
Pseudo Outer-Product (RSPOP) algorithm (Ang 
& Quek, 2005) is used to identify if-then fuzzy 
rules, perform attribute reduction and rule reduc- 
tion using rough set-based knowledge reduction. 
This vertical reduction process is performed 
autonomously without relying on user-defined 
heuristic thresholds to identify fewer if-then fuzzy 
rules. 

Supervised learning - is employed in RNFS in 
which the Supervised Pseudo Self-Evolving Cer- 
ebellar (SPSEC) algorithm (Ang & Quek, 2006a) 
is used to generate membership functions and the 
RSPOP algorithm is used to identify the if-then 
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fuzzy rules, instead of using backpropagation or 
hybrid learning approach. 
Automated model abstraction - is performed by 
RNFS without the need of user-defined parameters 
in the neural learning algorithms that generate the 
membership functions and identify the if-then 
fuzzy rules. Thus, it does not require specialized 
skills and knowledge. 

Supervised Learning Process of RNFS 

The RNFS employs a novel supervised learning ap- 
proach that comprises mainly of two algorithms: the 
SPSEC (Ang & Quek, 2006a) that generates member- 
ship functions, and the RSPOP (Ang & Quek, 2005) 
that identifies the if-then fuzzy rules. 

Figure 2 illustrates the neural learning process of the 
Supervised Pseudo Self-Evolving Cerebellar (SPSEC) 
membership function generation algorithm (refer to 
(Ang & Quek, 2006a) for details on the algorithm). 
Figure 2(a) shows steps 1-2 where SPSEC constructs 
a cerebellar structure with m regularly spaced neurons 
that spans the input space, in which the cerebellum 
is the part of our brain that is involved in learning 
of motor skills and provides precise coordination of 
motor control for our body parts (Kandel, Schwartz, 
& Jessell, 1995). These steps model the first-stage 
development process of our nervous system where 
the basic architecture and coarse connection patterns 
are laid out without any activity-dependent processes 
(Kandel et al., 1 995). Figure 2(b) shows steps 3-4, where 
SPSEC performs structural learning by performing a 
one-pass weight learning using Gaussian neighbor- 
hood learning to determine the distribution of the 
training data, and pseudo self-evolves this cerebellar 
structure by identifying surviving neurons with high 
trophic factor whose weights form a peak while the 
remaining neurons are removed. These steps model 
the second-stage development process of our nervous 
system where initial architecture is refined in activ- 
ity-dependent ways (Kandel et al., 1995). Figure 2(c) 
show step 5, where the surviving neurons' weights are 
the parameters of the resulting Gaussian membership 
function that reconcile with the semantics of interpre- 
table linguistic variables. 

The SPSEC algorithm (Ang & Quek, 2006a) is 
capable of generating effective membership functions 
that reconcile with semantics of interpretable linguistic 
variables; namely, coverage, normalized, convex and 



ordered (Mikut et al., 2005). The membership func- 
tions generated do not require a further optimization 
process used in the hybrid learning approach to increase 
the accuracy of the abstracted model. Eliminating the 
burden of further optimization process ensures that the 
membership functions generated do not deviate from 
human-interpretable terms. 

Figure 3 illustrates the rough-set based knowledge 
reduction process of the Rough set-based Pseudo 
Outer-Product (RSPOP) if-then fuzzy rule identification 
algorithm (refer to (Ang & Quek, 2005) for details on 
the algorithm). Figure 3(a) shows an example of a set of 
influential if-then fuzzy rules identified by the RSPOP 
Rule Identification steps using Hebbian learning (Hebb, 
1949). Figure 3(b) shows the if-then fuzzy rules that 
are reduced by the RSPOP Attribute Reduction steps 
using rough set-based knowledge reduction (Pawlak, 
1991). Figure 3(c) shows the if-then fuzzy rules that 
are further reduced by the RSPOP Rule Reduction steps 
using rough set-based knowledge reduction (Pawlak, 
1991). 

The numbers 0-4 in Figure 3 that represent condi- 
tion and consequent labels in Figure 1 do not show 
that the if-then fuzzy rules identified are interpretable. 
To illustrate the intuitiveness of the if-then fuzzy rules 
identified, an example on the mapping of semantic 
labels to each condition and consequent labels is il- 
lustrated in Table 1 . 

The example in Figure 3 and Table 1 shows that 
the RSPOP algorithm is capable of identifying fewer 
but effective if-then fuzzy rules that facilitates human 
interpretation. The fewer number of if-then fuzzy rules 
identified is effective because RSPOP integrates the 
knowledge reduction technique in rough set theory with 
the Hebbian learning technique to identify non-redun- 
dant if-then fuzzy rules. The accuracy of the abstracted 
model is not compromised because only reducts that 
do not deteriorate the accuracy of the abstracted model 
are reduced by RSPOP. 



FUTURE TRENDS 

The proposed RNFS architecture is based on the 
synergy of the sound concept of knowledge reduction 
in rough set theory with NFS (Ang & Quek, 2005). 
Existing NFS are not capable of abstracting accurate 
and interpretable models from high-dimensional data 
without first performing feature selection to reduce 
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Figure 2. SPSEC neural learning process (a) Steps 1-2 constructs a basic cerebellar structure consisting of 
regularly spaced neurons, (b) Steps 3-4 active neurons stablize through the update of trophic factors while com- 
petitors die, (c) Steps 5 surviving neurons form the resulting membership functions that reconcile with semantics 
of linguistic variables 




(a) 




(b) 




(c) 



the data dimensionality to a manageable quantity. 
The synergy of Rough set-based knowledge reduction 
strengthens and empowers NFS with the potential ap- 
plication on high-dimensional data such as microarray 
gene expressions and function Magnetic Resonance 
Imaging (fMRI). Hence, further investigation on the 
adequacy of abstracting high-dimensional data using 
RNFS should be carried out and compared against 
other prevailing approaches. 

1400 



In addition, existing NFS relied on user-defined 
parameters to avoid over-fitting the training data in 
order to generalize well on unseen test data. This issue 
is known as overfitting avoidance and Occam's razor 
in pattern recognition literature (Duda, Hart, & Stork, 
200 1 ). Although the RNFS is capable of autonomously 
abstracting a model from numerical data without requir- 
ing user-defined parameters, preliminary investigations 
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Figure 3. RSPOP rule identification process (a) RSPOP identifies an initial set of if -then fuzzy rules identified us- 
ing Hebbian learning; (b) RSPOP performs attribute reduction on the initial set of if-then fuzzy rules using rough 
set-based knowledge reduction; and (c) RSPOP performs a further rule reduction on the if-then fuzzy rules 
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revealed that RNFS may possibly abstract a model that 
overfits the training data. Therefore, further studies on 
how to avoid overfitting the training data by RNFS 
without relying on user-defined parameters should be 
carried out. 



CONCLUSION 

This article presents a Rough set-based Neuro-Fuzzy 
System (RNFS), which is a hybrid intelligent system 
that synergizes the sound concept of knowledge reduc- 
tion in rough set theory with the human-like reasoning 
style of fuzzy systems and the learning and connection- 
ist structure of neural networks. The main strength of 
Neuro-Fuzzy hybridization is universal approximation 



(Tikk et al., 2003) with the ability to solicit interpretable 
if-then fuzzy rules (Guillaume, 200 1 ). Rough set-based 
Neuro-Fuzzy hybridization strengthens it further by 
improving the interpretability as well as the accuracy 
of existing Neuro-Fuzzy hybridization. Recently, the 
Rough set-based Neuro-Fuzzy approach of abstract- 
ing models from data has been successfully applied 
to various applications such as traffic flow prediction 
(Ang & Quek, 2005), financial stock trading (Ang & 
Quek, 2006b) and the classification of biomedical data 
(Ang & Quek, 2006a). Hence, the potential of RNFS 
is exciting as it improves the interpretability of NFS 
without compromising the accuracy of the abstracted 
model. 
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Table 1. Example on the semantic interpretation of the if-then fuzzy rules identified in Figure 3 
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KEY TERMS 

Attribute Reduction: The process whereby dis- 
pensable attributes are removed from the knowledge 
while maintaining knowledge consistency. 

Fuzzy System: A system whose variables range over 
states that are fuzzy sets. A fuzzy system is capable of 
modelling the linguistic terms that are expressions of 
human language. 

Knowledge Reduction: Knowledge reduction in 
rough set theory comprises of attribute reduction and 
partial attribute reduction. 

Neural Network: Anetwork of many simple proces- 
sors called units or neurons. Aneural network is capable 
of learning the nonlinear relationships in data. 

Neuro-Fuzzy System: A hybrid intelligent system 
that synergizes the human-like reasoning style of fuzzy 
systems with the learning and connectionist structure 
of neural networks. 

Rough Set: Arough set is a formal approximation of 
a crisp set in terms of a pair of sets that give the lower 
and upper approximation of the original set 

Rough Set-Based Neuro Fuzzy System: A hybrid 
intelligent system that synergizes the sound concept of 
knowledge reduction in rough set theory with neuro- 
fuzzy systems. 

Rule Reduction: The process of partial attribute 
reduction whereby dispensable attributes from certain 
rules of the knowledge is removed while maintaining 
knowledge consistency. 
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INTRODUCTION 

Expert systems are successfully applied to a number 
of domains. Often built on generic rule-based systems, 
they can also exploit optimized algorithms. 

On the other side, being based on loosely coupled 
components and peer to peer infrastructures for 
asynchronous messaging, multi-agent systems allow 
code mobility, adaptability, easy of deployment and 
reconfiguration, thus fitting distributed and dynamic 
environments. Also, they have good support for domain 
specific ontologies, an important feature when model- 
ling human experts' knowledge. 

The possibility of obtaining the best features of 
both technologies is concretely demonstrated by the 
integration of JBoss Rules, a rule engine efficiently 
implementing the Rete-OO algorithm, into JADE, a 
FIPA -compliant multi-agent system. 



BACKGROUND 
Rule Engines 

The advantages of rule-based systems over procedural 
programming environments are well recognized and 
widely exploited, above all in the context of business 
applications. Working with rules helps keeping the 
logic separated from the application code: it can be 
modified by non-developers and, being centralized 
in one point, it can be analyzed and validated. Rule 
engines are often well optimized, being able to ef- 
ficiently reduce the number of rules to match against 
the updated knowledge base. 

Rule-based systems can also be augmented with 
ideas and techniques developed in other research 
fields, leading for example to fuzzy rule-based systems, 
which exploit fuzzy logic to deal with imprecision 
and uncertainty about the knowledge base. Moreover, 



sometimes these systems are coupled with genetic 
algorithms and evolutionary programming to generate 
complex classifiers. 

One of the most notable application of rule-based 
systems are expert systems, where the rule set is a rep- 
resentation of an expert's knowledge. In such systems, 
the AI (Artificial Intelligence) is supposed to perform 
in a similar manner to the expert, when exposed to 
the same data. 

Among the different mechanisms to implement a 
rule-engine, Rete algorithm (Forgy, 1982) has gained 
more and more popularity, mainly thanks to the high 
degree of optimization that can be obtained. At NASA 
Johnson Space Center, Rete algorithm was implemented 
in a whole generation of rule engines. OPS5 was soon 
replaced by its descendant, ART, and in 1984 by the 
more famous CLIPS. 

Nowadays, one of the most widespread engines 
implementing Rete is Jess (Friedman-Hill, 2000), at first 
developed as a Java port of CLIPS at Sandia National 
Laboratories in late 1990s. Jess has also been widely 
adopted by the agent community to realize rule-based 
agent systems (Cardoso, 2007). 

A different yet promising rule-engine is JBoss 
Rules (Proctor, Neale, Frandsen, Griffith & Tirelli, 
2007), formerly Drools. It is a quite new, but already 
well known, freeware tool implementing so-called 
Rete-OO algorithm. 

Its open-source availability is a clear advantage 
over Jess, but an even greater advantage is due to the 
implementation of a particular adaptation of the Rete 
algorithm for the object-oriented world, rather than 
a literal one. This way, the burden of integrating the 
rule-engine and application rules with existing external 
obj ects is greatly reduced. In fact, JBoss Rules uses plain 
Java objects to represent rules and facts, which can be 
modified through their public methods and properties. 
Rules can be specified through an appropriate syntax, 
or through xml structures, and their conditions and 
consequences can be expressed using different scripting 
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Rule Engines 

Figure 1. Basic example from the JBoss Rules handbook 

package org. drools. examples 

import org.drools.examples.HelloWorldExample.Message; 

rule "Hello World" 
when 

m : Message( status == Message.HELLO, message : message ) 
then 

System.out.println( message ); 
m.setMessage( "Goodbye cruel world" ); 
m.setStatus( Message.GOODBYE ); 
update( m ); 
end 

rule "GoodBye" 
no-loop true 
when 

m : Message( status = Message.GOODBYE, message : message ) 
then 

System.out.println( message ); 
end 



languages, as Python, Groovy and Java. Instead Jess 
only accepts rules written in the CLIPS language, thus 
requiring developers to learn a new Lisp-like language 
and deploy additional efforts to adapt it to their object- 
oriented development environment. 

Agent-Based Systems 

Multi-agent systems (MAS) show some complementary 
features which can be useful in many rule-based ap- 
plication, above all asynchronous interaction protocols 
and semantic languages. In multi-agent systems, in fact, 
many intelligent agents interact with each other. The 
agents are considered to be autonomous entities, and 
their interactions can be either cooperative or selfish 
(i.e. they can share a common goal, as in a production 
line, or they can pursue their own interests, as in an 
open marketplace). 

The Foundation for Intelligent Physical Agents 
(FIPA, 2002) develops open specifications, to support 
interoperability among agents and agent-based ap- 
plications. Specifications for infrastructures include a 
communication language for agents, services for agents, 
and they anticipate the management of domain-specific 



ontologies. A set of application domains is also speci- 
fied, including personal assistance for travels, network 
management, electronic commerce, distribution of au- 
dio-visual media. At the core of FIPAmodel there's the 
communication among agents; in particular it describes 
how the agents can exchange semantically-meaningful 
messages with the aim of completing activities required 
by the overall application. 

Various implementations of FIPA-compliant plat- 
forms exist (FIPA implementations, 2003). Among 
them, JADE (Bellifemine, Caire, Poggi & Rimassa, 
2003) has gained popularity during the years, while 
more and more core functionalities and third-party 
plug-ins were being developed. Currently it supports 
most of the infrastructure related FIPA specifications, 
like transport protocols, message encoding, and white 
and yellow pages agents. Moreover, it has various tools 
that ease agent debugging and management. 

The possibility of using rules to realize agent sys- 
tems seems to be promising. On the one hand, rules 
have been shown suitable to define abstract and real 
agent architectures and have been used for realizing 
so-called "rule-based agents", that is, agents whose 
behaviour and/or knowledge is expressed by means of 
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rules (Shoham, 1993) (Rao, 1996) (Hindriks, de Boer, 
van der Hoek & Meyer, 1998) (Schroeder & Wagner, 
2000). On the other hand, given that rules are easy 
and suitable means to realize reasoning, learning and 
knowledge acquisition tasks, rules have been used into 
so-called "rule-enhanced agents", that is, agents whose 
behaviour is not normally expressed by means of rules, 
but that use a rule engine as an additional component 
to perform specific reasoning, learning or knowledge 
acquisition tasks (Gutknecht, Ferber & Michel, 2000) 
(Katz, 2002). Both the approaches have some advan- 
tages and disadvantages. Rule-based agents provide all 
the advantages of rule-based systems and a uniform way 
to program them, but their performance is inadequate 
for some kinds of applications. Rule-enhanced agents 
allow the use of different programming paradigms; 
therefore, it is possible to use the most appropriate 
paradigm for the realization of the different tasks both 
to simplify the development and to satisfy the perform- 
ance requirements, but there is an additional cost for 
the management of the integration/synchronization of 
such heterogeneous tasks. With behaviour-based agents, 
as in JADE, the rule engine can be integrated into an 
agent as a behaviour. This approach can alternatively 
guarantee the advantages of full rule-based agents or the 
ones of rule-enhanced agents. In facts, both procedural 
and rule-based behaviours can be seamlessly added 
to each deployed agent, according to the application 
features and requirements. 

Code Mobility and Security 

Mobile code proves useful in many contexts (Fuggetta, 
Picco & Vigna, 2000), thanks to its ability to overcome 
network latency, reduce network load, allow asyn- 
chronous execution and autonomy, adapt dynamically, 
operate in heterogeneous environments, provide robust 
and fault-tolerant behaviours. Mobile code technologies 
vary from applets and other dynamic code downloading 
mechanisms, to full mobile agent systems, adhering to 
models as code on demand, remote evaluation, mobile 
agents. When a rule engine is integrated into a multi- 
agent system, two different cases are possible: asking 
a remote agent to execute a task, or to apply a new rule 
to its knowledge base. While mobile rules falls into the 
class of asynchronous requests with deferred execution, 
instead mobile tasks fall into the synchronous class. In 
both cases the moved entity is a fragment of code, to 



be interpreted by a scripting engine on the target agent, 
and not a complete thread of execution. 

The different security threats that a mobile code 
system could face, and the relevant security counter- 
measures that could be adopted, should also be ana- 
lyzed. In (Jansen & Karygiannis, 2000) two different 
classes of attacks are identified, depending on their 
target: the ones targeting the executing environment 
of mobile code, and the ones targeting the code itself. 
While the fact that mobile code could pose threats to its 
hosting environment is widely accepted, instead often 
the possibility to face threats against the hosted code 
is not taken into consideration. This is certainly due 
to a lack of effective countermeasures to prevent the 
hosting environment from stealing data and algorithms 
from the mobile code, from executing it too slowly to 
be effective, altering its execution flow, or stopping 
its execution. Experimental algorithms exist to at least 
detect "a posteriori" this type of threats, including 
partial result encapsulation, mutual itinerary record- 
ing, itinerary recording with replication and voting, 
execution tracing. Some algorithms even try to prevent 
some types of attacks to the code hosted in malicious 
environments, but their real effectiveness has yet to be 
proved; these include environmental key generation, 
computing with encrypted functions, and obfuscated 
code (sometimes called time limited blackbox). On 
the other hand, potential threats posed by hosted code 
include masquerading, denial of service, eavesdropping, 
and alteration. Available security countermeasures to 
protect the execution environment against potentially 
malicious mobile code often rely on algorithms to 
prevent attacks, like software-based fault isolation, 
safe code interpretation, authorization and attribute 
certificates, proof carrying code. Other techniques are 
focused on detecting attacks to the environment and 
tracing them to their origin; these include state appraisal, 
signed code, path histories. 

Systems based on Java can leverage on the security 
means provided by the virtual machine, and extend 
them as needed. In particular, it is possible to define 
precise protection domains on the basis of authorization 
certificates (Poggi, Tomaiuolo & Vitaglione, 2004). 
These certificates, attached to mobile code, list a set of 
granted permissions and are signed by local resource 
managers. Access rights can also be delegated to other 
agents, to allow them to complete the requested tasks 
or to achieve delegated goals (Somacher, Tomaiuolo 
& Turci, 2002). Finally, masquerading and alteration 
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threats can be prevented by establishing authenticated, 
signed and encrypted channels between remote com- 
ponents of the system. 



INTEGRATION OF RULES AND AGENTS 

Among the different implementations of rules-enhanced 
multi-agent systems, the analysis Drools4JADE (Drool- 
s4 JADE) can be particularly interesting, as the system 
resulted from the evaluation of existing technologies 
in various fields. In fact, the purpose of this project 
was to not start from scratch, to develop of a totally 
new agent platform, but instead to build on existing 
solutions, which already demonstrated to be a sound 
layer on which more advanced functionalities could 
be added. 

In this case, the chosen agent system is JADE (Belli- 
femine, Caire, Poggi & Rimassa, 2003). Its successful 
adoption in large international projects, like Agenticities 
(Poggi, Tomaiuolo & Turci, 2004), openNet and Tech- 
Net (Willmott, 2004), proved it to be preferable to other 
solutions, thanks to its simplicity, flexibility, scalability 
and soundness. As already argued, its integration with 
an open source object-oriented rule engine, as JBoss 
Rules, in many contexts is to be favoured against the 
more traditional JADE-Jess couple (Cardoso, 2007). 

FIPA Interface to the Rule Engine 

To the rich features of JBoss Rules, an agent environ- 
ment can add above all the support for communications 
through ACL (Agent Communication Language) mes- 
sages, typical of FIPA agents. Rules can reference ACL 
messages in both their precondition and consequence 
fields. Moreover, a complete support to manipulate 
facts and rules on rules-enhanced agents through ACL 
messages can be provided. 

Inside the JBoss Rules environment a rule is repre- 
sented by an instance of the Rule class: it specifies all 
the data of the rule itself, including the pre-conditions 
making the rule valid and the actions to be performed 
as consequence of the rule. When a rule is scheduled 
for execution, i.e. all its preconditions are satisfied by 
asserted facts, the engine creates a new instance of 
the embedded scripting environment, set the needed 
variables inside it and invokes the interpreter to ex- 
ecute the code contained in the consequence section 
of the rule. 



In Drools4JADE, rules-enhanced agents expose a 
complete API to allow the manipulation of their internal 
working memory through ACL requests. Their ontol- 
ogy defines requests to add rules, assert, modify and 
retract facts. All these requests must be joined with 
an authorization certificate. Only authorized agents, 
i.e. the ones that show a certificate listing all needed 
permissions, can perform requested actions. More- 
over, the accepted rules will be confined in a specific 
protection domain, instantiated according to their own 
authorization certificate. 

Security Issues 

Mobility of rules and code among agents paves the way 
for really adaptive applications, but it cannot be fully 
exploited if security issues aren't properly addressed. 
The security means implemented in Drools4JADE 
greatly benefit from the existing infrastructure provided 
by the underlying Java platform and by JADE. The 
security model of JADE deals with traditional user- 
centric concepts, as principals, resources and permis- 
sions. Moreover it provides means to allow delegation 
of access rights among agents, and the implementation 
of precise protection domains, by means of authoriza- 
tion certificates. 

In the security framework of JADE, a principal rep- 
resents any entity whose identity can be authenticated. 
Principals are bound to single persons, departments, 
companies or any other organizational entity. Also 
single agents are bound to a principal; with respect to 
his own agents, a user constitutes a parent principal, 
thus allowing to grant particular permissions to all 
agents launched by a single user. 

Resources that JADE security model cares for in- 
clude those already provided by security Java model 
(i.e. file system, network connections, environment 
variables, database connections). Resources typical 
of multi-agent systems, to be protected against unau- 
thorized accesses, include agents themselves and their 
executing environment. 

A permission represents the capability to perform 
actions on system resources. To take a decision while 
trying to access a resource, access control functions 
compare permissions granted to the principal with 
permissions required to execute the action; access is 
allowed if all required permissions are owned. 

When an agent is requested to accept a new rule 
or task, a first access protection involves authenticat- 
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Figure 2. Actions supported by rules-enhanced 
agents 
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code. Agents should be provided means to delegate not 
only tasks, but even access rights needed to perform 
those tasks. This is exactly what is made possible 
through the security package of JADE, where distributed 
security policies can be checked and enforced on the 
basis of signed authorization certificates. 

In this kind of systems, every requested action can 
be accompanied with a certificate, signed by a local 
resource manager, listing the permissions granted to 
the requester. Permissions can be obtained directly 
from a policy file, or through a delegation process. 
Through this process, an agent can further delegate a 
set of permissions to another agent, if it can prove the 
possession of those permissions. 

The final set of permissions received through the 
request message, can finally be used by the servant 
agent to create a new protection domain to wrap the 
mobile code during its execution, protecting the access 
to system, as well as application, resources. 



FUTURE TRENDS 

A system architecture founded on rule engines and 
multi-agent systems is a good starting point to build 
a fully distributed environment, where the distributed 
knowledge can include both data and code. For example 
it can be used to realize systems for distributed signals 
and alarms handling, network management etc. 

The development of advanced grid features, as 
transparent and reconfigurable functions for dynamic 
load balancing, distribution of facts and rules among 
remote engines, failure detection and recovery, could 
add even greater value to the system, paving the way for 
the development of distributed computing environments 
founded on networks of FIPA agents and platforms. 



CONCLUSION 



ing the requester and checking the authorization to 
perform the action; i.e.: can the agent really add a new 
rule, or submit a task, to be performed on its behalf? 
To perform these tasks, the requester needs particular 
permissions. 

Moreover, to exploit the full power of task delega- 
tion and rule mobility, the target agent should be able to 
restrict the set of resources made accessible to mobile 



The integration of an object-oriented rule engine and a 
scripting engine into an agent development framework 
can provide many advantages. The resulting system 
joins the soundness of a platform for distributed multi- 
agent systems, with the expressive power of rules and 
the ability to adapt to changing conditions granted by 
mobile code. 

Of course, the development of real world applica- 
tions poses serious security requirements, which can 
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be faced by means of detailed security policies and 
delegation of authorizations through signed certifi- 
cates. Application areas include, but certainly are not 
limited to, e-learning, e-business, service-composition, 
network management. 
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KEY TERMS 

Authorization Certificate: A digital document 
that describes a permission from the issuer to use a 
service or a resource that the issuer controls or has 
access to use. Usually it is signed by means of a public 
key algorithm. The permission in some case can also 
be delegated. 

Expert System: Encodes the knowledge of an expert 
into the rule set of a rule-based system. When exposed 
to the same data, the expert system AI will perform in 
a similar manner to the expert. 

Multi- Agent System: A software system based on 
the interaction of several agents. Such agents could 
not have all data or all resources needed to achieve an 
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objective and need to collaborate with other agents. In 
this case, data is decentralized and execution is asyn- 
chronous. Earlier, related fields include Distributed 
Artificial Intelligence (DAI) and distributed problem 
solving (DPS). 

Ontology: An explicit specification of a conceptu- 
alization, formally describing the entities involved in a 
particular domain and the relationships among them. 

Production System (or production rule system): 

a rule-based system whose rules (termed productions) 
consist of two parts: a sensory precondition (or "if" 
statement) and an action (or "then"). If a production's 
precondition (left-hand side or LHS) matches the cur- 
rent state of the world, then the production is said to 
be triggered. If a production's action is executed, it is 
said to have fired. The rule interpreter must provide a 
mechanism for prioritizing productions when more than 



one is triggered. Rule interpreters generally execute a 
forward chaining algorithm for selecting productions 
to execute. 

Rule-Based System: Created using a set of asser- 
tions, which collectively form the "working memory", 
a database which maintains data about current state or 
knowledge, a set of rules, specifying how to act on the 
assertion set, and a rule-engine or interpreter. Basically, 
rule-based systems can consisting of little more than 
a set of if-then statements, but provide the basis for 
so-called "expert systems". 

Software Agent: A software entity being able to 
act with a certain degree of autonomy, in order to ac- 
complish tasks on behalf of its user. While objects are 
defined in terms of methods and attributes, agents are 
defined in terms of their behaviours. Usually agents 
show persistence, autonomy, social ability, reactivity. 
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INTRODUCTION 

Sequence processing involves several tasks such as 
clustering, classification, prediction, and transduction 
of sequential data which can be symbolic, non-sym- 
bolic or mixed. Examples of symbolic data patterns 
occur in modelling natural (human) language, while 
the prediction of water level of River Thames is an ex- 
ample of processing non-symbolic data. If the content 
of a sequence will be varying through different time 
steps, the sequence is called temporal or time-series. 
In general, a temporal sequence consists of nominal 
symbols from a particular alphabet, while a time-series 
sequence deals with continuous, real-valued elements 
(Antunes & Oliverira, 200 1 ). Processing both these se- 
quences mainly consists of applying the current known 
patterns to produce or predict the future ones, while a 
major difficulty is that the range of data dependencies 
is usually unknown. Therefore, an intelligent system 
with memorising capability is crucial for effective 
sequence processing and modelling. 

A recurrent neural network (RNN) is an artificial 
neural network in which self-loop and backward con- 
nections between nodes are allowed (Lin & Lee 1996; 
Schalkoff, 1997). Comparing to feedforward neural 
networks, RNNs are well-known for their power to 
memorise time dependencies and model nonlinear 
systems. RNNs can be trained from examples to map 
input sequences to output sequences and in principle 
they can implement any kind of sequential behaviour. 
They are biologically more plausible and computation- 
ally more powerful than other modelling approaches, 
such as Hidden Markov Models (HMMs), which have 
non-continuous internal states, feedforward neural 
networks and Support Vector Machines (SVMs), which 
do not have internal states at all. 

In this article, we review RNN architectures and 
we discuss the challenges involved in training RNNs 



for sequence processing. We provide a review of learn- 
ing algorithms for RNNs and discuss future trends in 
this area. 



BACKGROUND 

One of the first RNNs was the avalanche network devel- 
oped by Grossberg (1 969) for learning and processing 
an arbitrary spatiotemporal pattern. Jordan's sequential 
network (Jordan, 1986) and Elman's simple recurrent 
network (Elman, 1990) were proposed later. 

The first RNNs did not work very well in practical 
applications, and their operation was poorly under- 
stood. However, several variants of these models were 
developed for real-world applications, such as robotics, 
speech recognition, music composition, vision, and their 
potential for solving real- world problems has motivated 
a lot of research in the area of RNNs. 

Current research in RNNs has overcome some of 
the major drawbacks of the first models. This progress 
has come in the form of new architectures and learning 
algorithms, and has led in a better understanding of the 
RNNs' behaviour. 



ARCHITECTURES OF RECURRENT 
NETWORKS 

In the literature, several classification schemes have 
been proposed to organise RNN architectures start- 
ing from different principles for the classification, i.e. 
some consider the loops of nodes in the hidden layers, 
while others take the types of output into account. For 
example, they can be organised into canonical RNNs 
and dynamic MLPs (Tsoi, 1998a); autonomous con- 
verging and non-autonomous non-converging (Bengio 
et al., 1993); locally (receiving feedback(s) from the 
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same or directly connected layer), outputfeedback, and 
fully connected (i.e. all nodes are capable to receive 
and transfer feedback signals to the other nodes, even 
within different layers) RNNs (dos Santos & Zuben, 
2000); binary and analog RNNs (Orponen, 2000). 

From mathematical point of view (Kremer, 2001), 
assuming thaty and z are respectively the response of 
the output layer and the output of the hidden layer, a 
static feedforward neural network can be formulated 
as follows: 



y=(f>(w n z + 5 n ) 
z=([)(w I x + 5 I ) j 



(1) 
(2) 



where f (•) denotes nonlinear activation function, 
W 1 and W n the weights of the hidden layer and the 
output layer, x the input vector, and b the biases. This 
general form could be easily transformed to describe 
a Feed-Forward Time-Delayed (FFTD) RNN by sub- 
stituting the following delayed equations with time 
index t, 



y(t) = 4>(w" Z (t) + b") 
z(t)=<\>(yt 1 s(t) + b 1 ) 

s(t)={x(t)ex(t-i)e---ex(t-d)} 



(3) 
(4) 
(5) 



where s(t) denotes the state vector at time t, © the Car- 
tesian product, d the number of delays. By adding a 
feedback connection from the hidden layer to the delay 
unit then Eq. (4) can be stated as 



where A is a diagonal matrix, which describes an El- 
man-type RNN. 

For the Nonlinear Autoregressive Network with 
Exogenous Inputs (NARX) the state is described as 



s(t)={x(t)0x(t-l)0-"0x(t-d+l)}0 

{y(t-l)0y(t-2)0..-0y(t-m)}, 



(?) 



where m is the number of output feedbacks. The formula- 
tions of a fully RNN can also be derived by combining 
Eqs. (3) and (7) with the following one: 



z(t) = (f)(Az(t-l)+W I s(0 + W I x(t)+5 I ) 



(8) 



z(t) = (f)(Az(t-l)+W I x(t)+5 I ) 9 



(6) 



Table 1 provides an overview of the various archi- 
tectures and of the relevant literature. 



LEARNING ALGORITHMS FOR 
RECURRENT NETWORKS 

With regards to training RNNs and storing information 
in their internal representations, Gradient Descent-based 
learning algorithms (GD) are the most commonly ap- 
plied methods, even though it has been claimed that 
GD has some drawbacks (Bengio et al., 1994). Firstly, 
when the delays or recursive connections are very 
deep, i.e. when long-term memory is required, the 
backpropagation error may be vanished and the training 
process could become inefficiently. Secondly, the most 
common way to apply GD algorithms into RNN is to 
unfold the recursive layers and train the whole network 
as a feedforward network. Another drawback is that the 
generalisation is highly affected by the samples in the 
training dataset. In temporal processing it is difficult 
to extract or prepare negative samples from a given 



Table 1. Classification summary of RNNs 



Recurrence 



Globally 



Locally 



Fully 



Partially 



Reference 



Brouwer (2005) 
Kremer & Kolen (2000) 
Puskorius & Feldkamp 
(1994) 



Assaad et al. (2005) 
Bone & Cardot (2005) 
Sperduti&Starita(1997) 
Temurtasm et al. (2004) 
Tino & Mills (2005) 



Pedersen (1997) 



All, except 
Pedersen (1997) 



Equations 



(3),(4),(7) 



(3),(6) 



(3),(7),(8) 



Globally/Locally 



1412 



Sequence Processing with Recurrent Neural Networks 



training dataset and the specific RNN then predicts or 
classifies new coming samples according to the learned 
knowledge only. 

In (Bengio et al., 1993), besides Backpropagation 
Through Time (BPTT) and real-time gradient computa- 
tion, approaches with space and/or time locality are also 
reviewed. However, local algorithms can be applied to 
some specific local feedback RNNs and for short-term 
memorisation only due to their inherent representation 
capabilities. The inefficiency of GD in learning long- 
term dependencies is mainly because previous informa- 
tion is treated initially as noise and gradually is ignored 
(Bengio et al., 1993; Bengio et al., 1994). Therefore 
two alternative algorithms are revised and discussed in 
Bengio 's works: the time- weighted pseudo-Newton and 
the discrete error propagation. The former applies the 
unfolding method to the pseudo-Newton optimisation 
and the later considers the limited case of propagation 
only; it has to be verified whether this would work on 
other more general situation or not. 

Two types of learning algorithms are discussed in 
(Pearlmutter, 1995): the fixed point, and the nonfixed 
point. Well-known algorithms such as BPTT and 
Real-time Recurrent Learning (RTRL) are included 
in this classification and a way of introducing time 
constants and time delays is also suggested. The 
method of extended RTRL (eRTRL) is also discussed 
and other relevant approaches, such as Elman nets, 
Jordan nets, the moving targets method, feedforward 
nets with state, teach forcing in continuous time and 
Kalman filter are reviewed. Pearlmutter (1995) also 
compares the complexity both in time and space, and 
discusses the learning mode, stability and locality of 
these algorithms. 

For fully connected hidden layer networks and 
dynamic MLPs, Tsoi (1998b) has investigated two 
first-order gradient learning algorithms. This work 
discusses some drawbacks of these methods, such as 
slow convergence and generalisation, and derives two 
2 nd -order approaches to speed up the convergence and 
to tackle the issue of weight pruning; it also provides a 
discussion on output sensitivity. The lower sensitivity 
of output to a specific adjustable parameter, the better 
performance of the network is. Although the related 
formulas are well defined in this work (Tsoi, 1998b), 
there is still a crucial constant which is used to set the 
level of sensitivity that should be defined by the users. 
Quasi-2 nd order methods, such as conjugate gradient, 
scaled conjugate gradient and Newton approach, have 



been also mentioned and pointed out as suitable only 
for batch training, while Kalman filter and extended 
Kalman filter are classified as 2 nd -order GD based 
learning algorithms, which can be used under online 
mode, where extended Kalman filter could be used to 
prune weights from a RNN. 

Kremer (2001) reviews 14 kinds of memories used 
in spatiotemporal connectionist networks, capable of 
computing the state vectors, and provides a general 
formulation for computing output vectors. The author 
also summaries 10 different kinds of updating rules, 
such as full GD, truncated GD, autoassociative GD, 
and stack learning. It examines three open issues: the 
temporal credit assignment, the representation capabili- 
ties and the knowledge encoding. 

From the point of view of time-series modelling, 
Kolehmainen (2003) covers BPTT and RTRL for learn- 
ing RNNs, while Dietterich (2002) suggests BPTT. In 
the same vein with Baldi ( 1 993) and Pearlmutter ( 1 995), 
fixed point networks are also considered and five rela- 
tive algorithms, such as BPTT and GD learning of time 
constants, gains and delays are summarised. 

Most RNN applications are still using first-order 
learning algorithms despite the drawbacks of the GD. 
Some attempts have been made to propose second-order 
learning algorithms, e.g. dos Santos & von Zuben (2000) 
proposed a quasi 2 nd -order method. Also, simulated an- 
nealing has given some promising results but the training 
time is relatively higher (Bengio et al., 1994). 

Table 2 provides an overview of RNNs learning, 
giving examples of training algorithms for locally and 
globally RNNs for various applications. 



FUTURE TRENDS 

Recent directions in RNN research focus on investigat- 
ing and proposing new ways for better modelling of non- 
stationarity in sequences, such as sequences produced 
when modelling speech or handwritten characters, with 
no temporal independence assumptions. 

With regards to architectures, hybrid models based 
on combinations of Hidden Markov Models and RNNs 
as well as modular structures are considered promising 
approaches to solve sequence processing problems 
that occur in natural language and speech process- 
ing. In addition, a number of applications of the so- 
called Long Short-Term Memory RNN (Hochreiter & 
Schmidhuber, 1997) have provided some encouraging 
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Table 2. Recurrent neural networks applications and learning algorithms 
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results, demonstrating that these recurrent architectures 
can overcome several of the fundamental problems of 
traditional RNNs, and efficiently learn to solve many 
previously unlearnable tasks. 

As far as RNN training is concerned and despite 
the popularity of gradient descent approaches, which 
enforce the monotone decrease of the learning error, 
there are new learning algorithms that are based on 
evolutionary algorithms (Schmidhuber et al., 2007) 
and nonmonotone learning strategies (Peng and Ma- 
goulas, 2007) that have shown potential for effective 
RNN training. 



CONCLUSION 

Recurrent networks constitute an elegant way of 
increasing the capacity of feedforward networks to 
deal with complex data in the form of sequences of 
patterns. Recurrent neural networks are well known 
for their power to model temporal dependencies and 
process sequences for classification, recognition, and 
transduction. Modern RNNs architectures are capable 
of learning to solve many previously unlearnable tasks, 
even in partially observable environments. In this ar- 
ticle, we presented several RNN models. We identified 
the main challenges involved in training RNNs and 
discussed several algorithmic approaches for training 
RNN for sequence processing. Lastly, we presented 
some future directions for work in this area. 
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KEY TERMS 

Artificial Neural Network: A network of many 
simple processors, called "units" or "neurons", which 
provides a simplified model of a biological neural 
network. The neurons are connected by links that carry 
numeric values corresponding to weightings and are 
usually organised in layers. Neural networks can be 
trained to find nonlinear relationships in data, and are 
used in applications such as robotics, speech recogni- 
tion, signal processing or medical diagnosis. 

Backpropagation through Time: An algorithm 
for recurrent neural networks that uses the gradient 
descent method. It attempts to train a recurrent neural 
network by unfolding it into a multilayer feedforward 
network that grows by one layer for each time step, 
also called unfolding of time. 

Extended Kalman Filter: An online learning 
algorithm for determining the weights in a recurrent 
network given target outputs as it runs. It is based on the 
idea of Kalman filtering, which is a well-known linear 
recursive technique for estimating the state vector of a 
linear system from a set of noisy measurements. 

Gradient Descent: A popular training algorithm 
that minimises the total squared error of the output 
computer by a neural network. To find a local minimum 
of the error function using gradient descent, one takes 
steps proportional to the negative of the gradient (or 
the approximate gradient) of the function at the cur- 
rent point. 

Neural Architecture: Particular organisation of 
artificial neurons and connections between them in an 
artificial neural network. 



Real-Time Recurrent Learning: A general ap- 
proach to training an arbitrary recurrent network by 
adjusting weights along the error gradient. This algo- 
rithm usually requires very low learning rates because 
of the inherent correlations between successive node 
outputs. 

Recurrent Neural Network: An artificial neural 
network with feedback connections. This is in contrast 
to what happens in a feedforward neural network, where 
the signal simply passes from the input neurons, through 
the hidden neurons, to the outputs nodes 

Sequence Processing: A sequence is an ordered 
list of objects, events or data items. Processing of a 
sequence may involve one or a number of operations, 
such as classification of the whole sequence into a cat- 
egory; transformation of a sequence into another one; 
prediction or continuation of a sequence; generation 
of an output sequence from a single input. 

Training Algorithm: A step-by-step procedure 
for adjusting the connection weights of an artificial 
neural network. In supervised training, the desired 
(correct) output for each input vector of a training 
set is presented to the network, and many iterations 
through the training data may be required to adjust 
the weights. In unsupervised training, the weights are 
adjusted without specifying the correct output for any 
of the input vectors. 
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INTRODUCTION 

In the artificial intelligence domain, an emerging re- 
search field that rapidly gains momentum is Automated 
Negotiations (Fatima, Wooldridge, & Jennings, 2007) 
(Buttner, 2006). In this framework, building intelli- 
gent agents (Silva, Romao, Deugo, & da Silva, 2001) 
adequate for participating in negotiations and acting 
autonomously on behalf of their owners is a very chal- 
lenging research topic (Saha, 2006) (Jennings, Faratin, 
Lomuscio, Parsons, Sierra, & Wooldridge, 2001). In 
automated negotiations, three main items need to be 
specified (Faratin, Sierra, & Jennings, 1998) (Rosen- 
schein, & Zlotkin, 1 994): (i) the negotiation protocol & 
model, (ii) the negotiation issues, and (iii) the negotia- 
tion strategies that the agents will employ. 

According to (Walton, & Krabbe, 1995), "Negotia- 
tion is a form of interaction in which a group of agents, 
with conflicting interests and a desire to cooperate try 
to come to a mutually acceptable agreement on the 
division of scarce resources". These resources do not 
only refer to money, but also include other parameters, 
over which the agents' owners are willing to negotiate, 
such as product quality features, delivery conditions, 
guarantee, etc. (Maes, Guttman, & Moukas, 1999) 
(Sierra, 2004). In this framework, agents operate fol- 
lowing predefined rules and procedures specified by 
the employed negotiation protocol (Rosenschein, & 
Zlotkin, 1994), aiming to address the requirements of 
their human or corporate owners as much as possible. 
Furthermore, the negotiating agents use a reasoning 
model based on which their responses to their opponent's 
offers are formulated (Muller, 1996). This policy is 
widely known as the negotiation strategy of the agent 
(Li, Su, & Lam, 2006). 



This paper elaborates on the design of negotiation 
strategies for autonomous agents. The proposed strate- 
gies are applicable in cases where the agents have strict 
deadlines and they negotiate with a single party over 
the value of a single parameter (single-issue bilateral 
negotiations). Learning techniques based on MLP and 
GR Neural Networks (NNs) are employed by the client 
agents, in order to predict their opponents ' behaviour and 
achieve a timely detection of unsuccessful negotiations. 
The proposed NN-assisted strategies have been evalu- 
ated and turn out to be highly effective with regards to 
the duration reduction of the negotiation threads that 
cannot lead to agreements. 

The rest of the paper is structured as follows. In the 
second section, the basic principles of the designed 
negotiation framework are presented, while the for- 
mal problem statement is provided. The third section 
elaborates on the NN-assisted strategies designed and 
provides the configuration details of the NNs employed. 
The fourth section presents the experiments conducted, 
while the fifth section summarizes and evaluates the 
results of these experiments. Finally, in the last sec- 
tion, conclusions are drawn and future research plans 
are exposed. 



THE AUTOMATED NEGOTIATION 
FRAMEWORK BASICS 

This paper studies a single issue, bilateral automated 
negotiation framework. Thus, there are two negotiating 
parties (Client and Provider) that are represented by 
mobile intelligent agents. The agents negotiate over 
a single issue based on an alternating offers protocol 
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(Kraus, 2001) aiming to maximize the utilities of the 
parties they represent. 

We hereafter consider the case where the negotia- 
tion process is initiated by the Client Agent (CA) that 
sends to the Provider Agent (PA) an initial Request for 
Proposal (RFP) specifying the features of the service/ 
product its owner is interested to obtain. Without loss 
of generality, it is assumed that the issue under negotia- 
tion is the price of the product or service. Thus, the PA 
negotiates aiming to agree on the maximum possible 
price, while the C A aims to reduce the agreement price 
as much as possible. Once the PA receives the RFP of 
the CA, it either accepts to be engaged in the specific 
negotiation thread and formulates an initial price of- 
fer, or rejects the RFP and terminates the negotiation 
without a proposal. At each round, the PA sends to the 
CA a price offer, which is subsequently evaluated by 
the CA against its constraints and reservation values. 
Then, the CA generates a counter-offer and sends it to 
the PA that evaluates it and sends another counter-offer 
to the CA. This process continues until a mutually ac- 
ceptable offer is proposed by one of the negotiators, or 
one of the agents withdraws from the negotiation (e.g. 
in case its time deadline is reached without an agree- 
ment being in place). Thus, at each negotiation round, 
the agents may: (i) accept the previous offer, if their 
constraints are addressed, (ii) generate a counter-offer, 
or (iii) withdraw from the negotiation. 

Quantity p" denotes the price offer proposed by 
negotiating agent a during negotiation round /. A 
price proposal p* is always rejected by agent a if 
pf £ [p£ , pjjf ], where [p£, Pm ] denotes agent-a's ac- 
ceptable price interval. In case an agreement is reached, 
we call the negotiation successful, while in case one of 
the negotiating parties quits, it is called unsuccessful. 
In any other case, we say that the negotiation thread 
is active. The objective of our problem is to predict 
the PAs behaviour in the future negotiation rounds 
until the CA's deadline expires. More specifically, the 
negotiation problem studied can formally be stated as 
follows: 

Given: (i) two negotiating parties: a Provider that 
offers a specific good and a Client that is interested 
in this good's acquisition, (ii) the acceptable price 
interval [p£ , p£ ] for the Client, (iii) a deadline T c up 
to which the Client must have completed the negotia- 
tion with the Provider, (iv) the final negotiation round 
index L c for the Client, (v) a round threshold L d c until 
which the Client must decide whether to continue be- 



ing engaged in the negotiation thread or not, and (vi) 
the vector Pf = {pf }, where / = 2k- 1 and * =1 > |^J, of 
the prices that were proposed by the Provider during 
the initial L d c - 1 negotiation rounds, find (i) the vector 
p i P = {Pr }, where /' = 2k f - 1 and *'=[f J+i,-^, of the 
prices that will be proposed by the Provider during 
the last L c - L d c rounds, and (ii) decide on whether the 
Client should continue being engaged in the specific 
negotiation thread or not. 



A NEGOTIATION STRATEGY BASED 
ON NEURAL NETWORKS 

The policy employed by negotiating agents in order to 
generate a new offer is called negotiation strategy. In 
principle, three main families of automated negotia- 
tion strategies can be distinguished: time-dependent, 
resource-dependent and behaviour-dependent strategies 
(Faratin, Sierra, & Jennings, 1998). These strategies 
are well defined functions that may use various input 
parameters in order to produce the value of the issue 
under negotiation to be proposed at the current nego- 
tiation round. The proposed mechanism enhances any 
of the legacy strategies with learning techniques based 
on Neural Networks (NNs). In the studied framework, 
the NN-assisted strategies are used by the CA in order 
to estimate the future behaviour of the PA. This sec- 
tion presents the proposed NN-assisted strategy and 
describes the specifics of the NNs employed. 

Enabling PA Behaviour Prediction 

As already mentioned, the research presented in this 
paper aims to estimate the parameters governing the 
PA's strategy enabling the CA to predict the PA's fu- 
ture price offers. The objective is to decide at an early 
round whether to aim for an agreement with the specific 
PA, or withdraw from the negotiation thread as early 
as possible, if no agreement is achievable. For this 
purpose, two different Neural Networks (NNs) have 
been employed. These NNs are trained off-line with 
proper training sets and are then used during the on- 
line negotiation procedure whenever the CA requires 
so. The procedure starts normally, and as long as there 
are enough proposals made by the PA, the C A uses the 
NNs to make a reliable prediction of its opponent's 
strategy. This requires only a few negotiation rounds 
(compared to the CA's deadline expiration round) and 
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this is the main reason why this technique turns out to 
be significantly useful. 

In addition to the [p£ , p^ ] interval, there are mainly 
3 other parameters that determine the agent's negotia- 
tion strategy: parameter k a e [0,l] that determines the 
initial offer made by the agent at t = 0, the concession 
rate P > (Faratin, Sierra, & Jennings, 1998), and the 
PA's last round L . In this paper, k° does not lie among 
the parameters for prediction as it is safely assumed 
that the PA initiates the procedure from its maximum 
price offer. Without loss of generality, we focus on the 
case where the PA follows a polynomial strategy of 
arbitrary concession rate and timeout. 

The C A negotiates based on a legacy strategy until 
round L d c Then, the C A makes use of the NNs to obtain 
estimations (3 and L p . Round L d c will be hereafter called 
the prediction round. In the experiments conducted we 
have L d c = 30 and L c = 100. Based on the history of 
the PA's price offers, NNs attempt to produce a valid 
estimation of the PA's offer generation function. Then, 
the CAmay determine whether the current negotiation 
thread can lead to an agreement or this is not feasible 
given the C A's deadline. Thus, the NN-assisted strategy 
enables the CA to save time and withdraw early from 
negotiation threads that will end unsuccessfully. 

The Neural Networks Employed 

In our framework, where the prediction of a continuous 
function is required, we selected to study two types 
of NNs having no feedback loops: the multilayer 
perceptron (MLP) NN and the Generalized Regres- 
sion (GR) NN. The latter is a special case of a Radial 
Basis Function (RBF) NN that is more appropriate for 
on-line function approximation (Haykin, 1999). Both 
networks were selected because of their suitability in 
such kinds of problems (Haykin, 1999). 

For the MLP, we used a training function based on 
the Levenberg-Marquardt algorithm (Hagan, Demuth, 
& Beale, 1996) as it is the most convenient for such 
problems. The network was properly trained over 190 
different input vectors (200 epochs each) representing a 
different combination of PA's offers based on a specific 
strategy. The best MLP architecture was decided after 
extensive experiments and set to 23 (log-sigmoid) - 3 
(linear) neurons, for the hidden and the output layer 
respectively. Similarly, the GR network was trained over 
280 different vectors to achieve accurate performance 
characteristics resulting in a 280 (hidden RBF neurons) 
- 3 (output) architecture. 



Both NNs are employed by CAs and can provide 
reliable prediction of the PA's behaviour, once sufficient 
input samples (proposals) are available. The experi- 
ments conducted and the NNs performance evaluation, 
are presented in the two following sections. 



EXPERIMENTS 

In this section, the experiments conducted to evaluate 
the performance of the designed MLP and the GR 
NNs concerning the estimation of the future behaviour 
of the negotiating PA are presented. The first experi- 
ments' family aims to compare the actual behaviour 
of the PA with the one predicted by the MLP and the 
GR NNs, when \p™>Pm]= [0,100]^ l ^ = 20Q and p G 
[0.1, 10]. The sample values for p are derived from a 
uniformly distributed random vector of 100 values in 
the aforementioned area: 50 p < 1 (Boulware) and 50 
P > 1 (Conceder). The estimated parameters include: 
the future PA offers until the 100 th negotiation round, 
the minimum PA price offer until then and the PA's 
concession rate (|3). The second experiment family 
investigates the case where \pi, Pm]= [0,100], p = 
1 and L e [150, 250]. The sample values for L are: 
150:1:250. The estimated parameters include: the future 
PA price offers until the 100 th negotiation round and 
the minimum PA price offer until then. 

As illustrated in Figure 1 , where the first experiment 
set is depicted, the MLP- and the GR-NN perform 
very similarly, managing to accurately predict the PA's 
price offer in general. In the same Figure, one may ob- 
serve that both NNs are used until P < 2.8. For higher 
concession rates and for polynomial PA strategies, an 
agreement is reached before the 30 th round and the NN 
is not necessary for opponent behaviour prediction. As 
depicted in Figure 2, where the NNs are tested over 
linear PA's strategy, the MLP- and the GR-NN perform 
almost identically estimating the PA's price offer with 
low error margin. However, the deviation between 
the actual and the estimated PA offers increases as the 
round index increases and the PA timeout decreases. 
This is due to the fact that both NNs have a tendency 
to slightly underestimate PA's concession rate, espe- 
cially when p > 0.5. This is confirmed by Figure 3a, 
where the MLP-NN and the GR-NN estimations of the 
concession rate are depicted along with the actual P of 
the PA, over the entire set of conducted experiments. 
Finally, as depicted in Figure 3b, with regards to the 
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Figure 1. Actual PA price offer and PA price offer predicted by (a) a MLP-NN and (b) a GR-NN, for 100 nego- 
tiation rounds when L = 200, p p = 0,p p = 100 and p e [0.1,10] 
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Figure 2. Actual PA price offer and PA price offer predicted by (a) an MLP-NN and (b) a GR-NN, for 100 ne- 
gotiation rounds when /? = 1, p p = 0, p p = 100 and L e [150, 250] 
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Figure 3. (a) Actual and estimated (by MLP and GR NNs) concession rate values when L p = 200, p m p = 0, p M p 
= 100 and P e [0.1, 10]. (b) Actual and estimated (by MLP and GR NNs) PA minimum price offer for all the 
experiments conducted in both families. 
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estimation of the PA's minimum price offer, the MLP 
slightly outperforms the GR. A brief analysis of all these 
findings is presented in the subsequent section. 



EVALUATION 

In Table 1 comparative results for two experiment fami- 
lies are illustrated with regards to the mean estimation 
errors of the MLP and the GR NNs concerning the PA 
price offer, the PA minimum price offer and the PA's 
concession rate. For all experiment families we have 
\Pm > Pm ] = [0,100]. The rest of the parameter settings 
are presented in the table's first column, while at the 
second column the number of the experiments where 
the NN estimation was used is depicted. The results 
presented in the rest of the table indicate that the MLP 
NN slightly outperforms the GR NN with regards to 
the PA (minimum) price offer estimation demonstrat- 
ing 0.5% - 2.4% higher accuracy in average. However, 
the opposite stands concerning the PA beta estimation, 
as the GR NN provides more accurate estimations by 
more than 3% in average. 

In Table 2, evaluation results for the two NN-as- 
sisted negotiation strategies are illustrated for both 
experiment families assuming that p c M = 50 1 . These 
results include the number unsuccessful negotiation 



threads (UNTs) (that are due to the fact that p p > p 



) and the duration of the UNTs (L c = 100 2 ) in case no 
opponent behaviour prediction mechanism is used, the 
number of UNTs detected by the NNs at round 30, the 
UNTs that were detected and thus terminated early by 
the NNs, the mean duration of the UNTs and the mean 
UNT duration decrease. These results indicate that the 
MLP and the GR NNs manage to identify -91% and 
-83% of the UNTs in average, respectively. Further- 
more, the MLP and the GR NNs achieve -64% and 
-58% reduction of the UNTs' duration in average, re- 
spectively. With regards to the elimination of the UNTs, 
the MLP-assisted strategy clearly outperforms the GR- 
assisted negotiation strategy. For the reasons above, it 
is estimated that MLP NNs are more appropriate for 
assisting negotiating intelligent agents to predict their 
opponent's behaviour at an early negotiation round in 
case the agent values a timely detection of unsuccessful 
negotiation threads. 



CONCLUSION 

This paper proposed to use Neural Networks in order 
to enhance negotiating agents with learning techniques 
enabling them to predict their opponents' negotiation 
behaviour. The designed NN-assisted negotiation strat- 
egy turns out to be very useful, as it leads to substantial 
duration reduction of unsuccessful negotiation threads, 
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Table 1. Comparative results concerning the mean estimation error of the two NN-assisted negotiation strategies 
for the PA price offers, for the PA min offer and the PA concession rate. 




Experiment Settings 


Times NN- 
estimation 
was used 


Mean [price-offer 
estimation error] 


Mean [min-price- 
offer estim. error] 


Mean [beta 
estimation error] 


MLP 


GR 


MLP 


GR 


MLP 


GR 


P e [0.1, 10], 
L p = 200 


4118 


0.97% 


2.12% 


0.41% 


2.80% 


15.65% 


8.26% 


L p e [150, 250], 
fi = l 


7171 


1.21% 


1.71% 


8.26% 


8.91% 


12.51% 


12.73% 


OVERALL 


11289 


1.12% 


1.86% 


5.40% 


6.68% 


13.92% 


10.72% 



Table 2. Comparative results concerning the unsuccessful negotiation thread detection by the two NN-assisted 
negotiation strategies 



Experiment 
Settings 


# Unsuc. 
Negot. 
Threads 

(UNTs) 


Mean 
duration 
of UNTs 
(noNN) 


#UNTs 

detected at 

round 30 


UNTs' 

elimination 

ratio 


MeanUNTs' 
duration 


MeanUNTs' 
duration 
decrease 


MLP 


GR 


MLP 


GR 


MLP 


GR 


MLP 


GR 


P e [0.1, 10], 
L p = 200, p c M = 50 


50 


100 


49 


49 


98.0% 


98.0% 


31.4 


31.4 


68.6% 


68.6% 


L p e [150, 250], 
P = 1,P C M =50 


51 


100 


43 


35 


84.3% 


68.6% 


41.0 


52.0 


59.0% 


48.0% 


OVERALL 


101 


100 


92 


84 


91.1% 


83.2% 


36.2 


41.8 


63.8% 


58.2% 



due to the fact that the cases where agreements are not 
achievable are detected at an early stage. Thus, the NNs 
support the decision of the agents to withdraw or not 
from the ongoing negotiation threads. More specifi- 
cally, when the CA uses the NN-assisted strategies it 
is capable of predicting its opponent's behaviour with 
significant accuracy, thus getting aware of the potential 
outcome of the negotiation. Both the MLP and the GR 
NNs studied demonstrate average opponent price offer 
estimation error lower than 2% and PA min acceptable 
price estimation error ~6%. Additionally, the unsuc- 
cessful negotiations are detected by the MLP NN in 
more than 90% of the cases in average, demonstrating 
-8% better overall performance than the GRNN. Thus, 
the MLP NN is proven to be more appropriate, when 
the CA aims to avoid a possible unprofitable or even 
unachievable agreement. This leads to minimization 
of the required time and processing resources and to 
maximization of the CAs overall profit from a series 
of threads for a single commodity. 
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KEY TERMS 

Automated Negotiation: It is the process by which 
group of actors communicate with one another aiming 
to reach to a mutually acceptable agreement on some 
matter, where at least one of the actors is an autonomous 
software agent. 

Bilateral Negotiation: A negotiation procedure, 
where exactly two parties are involved, i.e. a client 
and a provider. 

Backpropagation Algorithm: A supervised learn- 
ing technique used for training artificial NNs based 
on the minimisation of the error obtained from the 
comparison between the desired output and the actual 
one when applying specific inputs. 

Generalized-Regression (GR) NN: A GR NN is 
a special case of a RBF NN with a second linear layer 
and is often used for function approximation. 

Multi-Layer Perceptron (MLP): A fully con- 
nected feedforward NN with at least one hidden layer 
that is trained using back-propagation algorithmic 
techniques. 

Negotiation Protocol: The set of rules that govern 
the interactions between the negotiating parties. 

Negotiation Strategy: The reasoning model based 
on which the negotiating parties formulate their response 
to their opponent's offers. 

Neural Network (NN): A network modelled after 
the neurons in a biological nervous system with multiple 
synapses and layers. It is designed as an interconnected 
system of processing elements organized in a layered 
parallel architecture. These elements are called neu- 
rons and have a limited number of inputs and outputs. 
NNs can be trained to find nonlinear relationships in 
data, enabling specific input sets to lead to given target 
outputs. 

Radial Basis Function (RBF): Function that 
involves a distance criterion with respect to a centre, 
such as a circle, ellipse or Gaussian. 

RBF NN: It is an artificial NN, the activation func- 
tions of which are radial basis functions. It has two 
layers of processing, where the first maps the input onto 
each RBF neuron in the other (hidden) layer. 
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ENDNOTES 

1 We selected the p c M to be equal to the median 
value in the PA's acceptable price interval. 

2 To be more accurate, the duration of UNTs is equal 
to: min(L c , L p ). However, in this paper's study, 
we always have L c < L p , and thus the duration of 
UNTs is equal to L c . 
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INTRODUCTION 

The agent paradigm has recently increased its influence 
in the research and development of computational logic- 
based systems. A clear and correct specification is made 
through Logic Programming (LP) and Non-nomotonic 
Reasoning that have been brought (back) to the spot- 
light. Also, the recent significant improvements in the 
efficiency of LP implementations for Non-monotonic 
Reasoning (De Schreye, Hermenegildo & Pereira, 
1999) have helped to this resurgence. However, the 
agents need update constantly their knowledge base 
and, particularly the intentional base (rules) such that 
our agent has the ability to reacting to changes in dy- 
namic environments is of crucial importance within 
the context of software agents. Such feature should 
correspond to a deliberative rational behavior wanted 
for our agents. 

The quality of the service that an agent offers is 
based on the form in which an agent combines ratio- 
nality and reactivity. A reactive agent can offer well 
evaluated recommendations but, this response is based 
on outdated information, while a rational behavior may 
generate recommendations based on the most recently 
acquired information. So, we are interested in devel- 
oping environment-aware agents. For this reason, is 
very important to have an update process for agents, 
i.e., that it allows us to design agents with its rational 
component. 

Over recent years, several semantics for logic 
program updates have been proposed (Brewka, Dix, 
& Knonolige 1997) (De Schreye, Hermenegildo, & 
Pereira, 1999) (Katsumo & Mendelzon, 1991). All 



these semantic ones coincide in considering the AGM 
proposal as the standard model in the update theory, for 
their wealth in properties. The AGM approach, intro- 
duced in (Alchourron, Gardenfors & Makinson, 1 985) is 
the dominating paradigm in the area, but in the context 
of monotonic logic. All these proposals analyze and 
reinterpret the AGM postulates under the Answer Set 
Programming (ASP) such as (Eiter, Fink, Sabattini & 
Thompits, 2000). However, the majority of the adapted 
AGM and update postulates are violated by update 
programs, as shown in (De Schreye, Hermenegildo, & 
Pereira, 1999). For this reason, we have been working 
in finding properties that our update operator satisfies 
(Osorio & Zacarias, 2003) (Zacarias & Osorio, 2005) 
(Arrazola & Zacarias, 2005). Our purpose is to build 
a semantics based on structural properties. This is our 
main objective in the update theory. In (De Schreye, 
Hermenegildo, & Pereira, 1999) (Osorio & Zacarias, 
2003) (Zacarias, Osorio & Arrazola, 2005) (Zacarias, 
2005) the authors present a set of properties that the 
update operator satisfies. In this paper we continue with 
this same research line presenting a novel proposal 
with the aim to enrich the update theory that we have 
begun in (Osorio & Zacarias, 2003) (Zacarias, Osorio & 
Arrazola, 2005) (Zacarias, 2005). This novel proposal 
contributes with two benefits. First, we conserve many 
of the properties presented in previous works (Osorio 
& Zacarias, 2003) (Zacarias, Osorio & Arrazola, 2005) 
(Zacarias, 2005), such as: Weak Irrelevance of Syntax 
(WIS). This property is similar to one postulate proposed 
by AGM, but in this case for nonmonotonic logic and 
under Answer Set Programming (ASP) introduced 
and defined by (Gelfond & Lifschitz, 1988). 
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BACKGROUND 

In this section, we present advances in the updates 
context. Also, we give some general definitions for our 
theory. We define our theory about logic programs. 

Advances on Updates 

We consider the task of updating logic programs under 
non-monotonic reasoning and a purely logical view. 
Since an intelligent agent is situated in an environment 
which is subject to change, it is required the agent to be 
adapted over time. For agents utilizing logic program- 
ming techniques for representing their knowledge, it 
is required the agent to be capable of updating logic 
programs accordingly, in order to ensure adaptability. 
We chose one of the approaches; viz. update answer set 
semantics (Zacarias, 2005) (Osorio & Zacarias, 2003) 
(Eiter, Fink, Sabattini & Thompits, 2000) (Banti, Alferes 
& Brogi, 2003). Resides, an underlying update seman- 
tics, which specifies how new, possibly inconsistent 
information, have to be incorporated into the knowledge 
base, an agent needs to have a certain update policy, 
i.e., a specification of how to react upon the arrival of 
an update. The issue of how to specify change requests 
for knowledge bases has received growing attention 
more recently and suitable specification languages for 
non-monotonic logic programs have been developed 
(Leite, 2001) (Leite, 2002). 

In (Zacarias, 2005) we have introduced a new 
proposal towards the enrichment of the update opera- 
tor "©". There, we have presented a refinement of the 
stable model semantics for the update operator. Also, 
we presented a new property that allows us to face up- 
dates where new information contains rules that define a 
conservative extension. So, we gave an extension of our 
properties proven in (Osorio & Zacarias, 2003), under 
N logic. This approach is based on the work made by 
Eiter et al. (Eiter, Fink, Sabattini & Thompits, 2000), 
and inspired in a recent approach presented by Alferes 
et al. (Banti, Alferes & Brogi, 2003). With this work, 
we improve and enrich the update operator proposed by 
Eiter et al. (Eiter, Fink, Sabattini & Thompits, 2000), 
giving as result a new update operator. 



UPDATES FOR REAL TIME 
APPLICATIONS 

In this section we present a novel mechanism that allows 
updating a knowledge base in a quick and easy way. 
Furthermore, this proposal satisfies similar structural 
properties to those that we have presented in previous 
works. So, we give the basic concepts for our theory 
and we present our main contribution based on signed 
formulae (Ariely, Denecker, Nuffelen & Bruynooghe, 
2004). 

Preliminary 

Rules are built from propositional atoms and the 0-place 
connectives T and _L using negation as failure (— i) and 
conjunction (,). A rule is an expression of the form: 




Head <- Body 



a) 



If Body is T then we identify rule (1) with rule 
Head. If a Head is _L then we identify rule (1) with a 
restriction. Aprogram is a set of rules. A logic program 
P is a (possibly infinite) set of rules. For a program P, 
/ is a model of P, denoted / |= P, if / 1= L f or all L e P. 
As it is shown in (Brewka, Dix, & Knonolige, 1997), 
the Gelfond-Lifschitz transformation for a program P 
and a model N 
appear in P) is defined by 



B p (B p denotes a set of atoms that 



P N = {rule N : rule e P} 

where (A <- B 1? . . . , B m , -.C,, . . ., -C n ) N is either: 

a. A<-B 1? ...,B m ,ifVj<n:CJ£N; 

b. T, 



otherwise 

Note that P N is always a definite program. We can 
therefore compute its least Herbrand model (denoted 
as M pN ) and check whether it coincides with the model 
N which we started with: 



Definition 1. (Gelfond & Lifschitz, 1 988) N is a stable 
model of P iff N is the minimal model of P N 
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DATABASE REPAIRS 

In this document we present our more recent approach 
about update under repairs to database. The way in that 
we approach the update problem is based on (Ariely, 
Denecker, Nuffelen & Bruynooghe, 2004), i.e., based 
on the idea of repair inconsistent database. So, given 
a possibly inconsistent database this mechanism rep- 
resents the possible ways to restore its consistency in 
terms of signed formulae. In our context, this mecha- 
nism is used to incorporate new information that can 
be inconsistent with the previous database containing 
the agent's knowledge. 

In a similar form as in (Ariely, Denecker, Nuffelen 
& Bruynooghe, 2004), L defines a first order language, 
S is a fixed database schema and D a fixed domain. A 
database instance D consists of atoms in the language 
L, where D has a finite active domain, A(D), which is 
a subset of D. 

Definition 2. (Ariely, Denecker, Nuffelen & Bruy- 
nooghe, 2004) a database is a pair (D, IC), where 
D is a database instance, and IC, the set of integrity 
constraints, is a finite and classically consistent set of 
formulae in L. 

So, a database is (Ariely, Denecker, Nuffelen & 
Bruynooghe, 2004): DB = (A IQ, let 

DB A = D u ICA = D u {p(\|/) I \\fGlC, p : var(\|/) -> 
A(D)} 

Where p is a ground substitution of variables to the 
individuals of A(D), the active domain of D, DB is 
called the Herbrand expansion of DB. As D, IC and 
A(D) are all finite sets, DB A is also finite, and so Z DB 
= {p 1? p 2 , ..., p n }, the set of the atomic formulae that 
appear in DB A , is finite as well. 



UPDATES BASED ON SIGNED 
FORMULAE 

For their simplicity our new update process is suitable 
for applications that have to give answers in real time. 
In (Ariely, Denecker, Nuffelen & Bruynooghe, 2004) 
the authors suppose that the database is inconsistent, and 
then they give a general description of how to restore 
the consistency of databases instances that do not satisfy 



a given set of integrity constraints. Here, we adapt this 
method to updates of logic programs and we illustrate 
that this can be used efficiently in our context. 

Here, we consider updates in the setting of logic 
programs, i.e., we consider that a database is repre- 
sented by a logic program. So, in our context, we start 
with the fact that our database is updated with new 
information that it causes an inconsistent database. At 
this moment, we apply this method with the objective 
of making our database consistent. Follow, we pres- 
ent a general framework used in (Ariely, Denecker, 
Nuffelen & Bruynooghe, 2004). A database is a pair 
(D, IC), where D is a database instance, and IC, the 
set of integrity constraints, is a finite and classically 
consistent set of formulae in a language. 

Given a possibly inconsistent database, our goal is 
to restore its consistency, i.e., to repair the database: 

Definition 3. Similarly as in (Ariely, Denecker, Nuffelen 
& Bruynooghe, 2004) an update of a database DB = 
(D, IC) is a pair {Insert, Retract}, s.t. Insert D D = 
(|> and Retract c D. A pair of DB is an update of DB, 
for which (D u Insert \ Retract, IC) is a consistent 
database. 

The intuitive meaning is as follows: a database is 
updated by inserting the elements of insert and re- 
moving the elements of Retract. An update is a pair 
when the resulting database is consistent. Note that if 
DB is consistent, then (c|>, §) is a pair of DB. 

As follows, we give some examples that illustrate 
how this mechanism is adapted to updates. 

Example 1. This example illustrates a daily update 
regarding the energy flaw (Eiter, Fink, Sabattini & 
Thompits, 2000). Suppose that you have the follow- 
ing database: 

DB: sleep <— ^tv-on. 
night, 
tv-on. 

watch-tv <— tv-on. 
<— power-failure, tv-on. 

Here, the DB is consistent, however, if we update 
the DB with the following rule: 

power-failure. 
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This DB is not consistent. Following the format 
defined in (Ariely, Denecker, Nuffelen & Bruynooghe, 
2004) this example is the pair: 



Example 2. This program describes some knowledge 
about the sky (Banti, Alferes & Brogi, 2003). Suppose 
that you have the following database: 




DB = ({sleep <— ^tv-on., night., tv-on., watch-tv <— 
tv-on., power-failure.}, {<— power- failure, tv- 
on.}) 



DB: day <— ^night. 
^see-stars. 
<— see-stars, day. 



Therefore, we can adapt the method proposed in 
(Ariely, Denecker, Nuffelen & Bruynooghe, 2004) 
and to carry out a correct update of this database as 
follows: 

First, we add our initial configuration the new in- 
formation (update): 

DB = ({sleep <— ^tv-on., night., tv-on., watch-tv <— 
tv-on., power-failure.}, {<— power- failure, tv-on, 
power-failure.}) 

Second, we obtain the signed formula: 

(^power- failure v ^tv-on) a power-failure. 

Using format of (Ariely, Denecker, Nuffelen & 
Bruynooghe, 2004) we obtain: 



Here, the DB is consistent, however, if we update 
the $DB$ with the following rule: 

see-stars. 

This DB is not consistent. Following the format 
defined in (Ariely, Denecker, Nuffelen & Bruynooghe, 
2004) this example is the pair: 

DB = ({day <— ^night, ^see-stars.}, {<— see-stars, 
day.}) 

Therefore, we can adapt the method proposed in 
(Ariely, Denecker, Nuffelen & Bruynooghe, 2004) ob- 
taining a correct update of this database as follows: 

First, we add our initial configuration to the new 
information (update): 



(— S ., v— S n )a-S f , . 

v power-failure tv-on 7 power-failure 



it is equivalent to: 



(S . , v S t ) a -S f , . 

v power-failure — tv-on 7 power-lailure 



Third, we calculate v R the valuation that is associ- 
ated with R, obtaining: 



v R (S f , ) = 1 

v power-failure 7 

v R (S ) = 1 



tv-on 7 

power-failure 7 " 



v R (-S .., ) = 



Therefore, we have: ({power- failure}, {tv-on}) 
This means that the suggested update is: 

Inserts {power- failure} 
Retracts {tv-on} 



DB = ({day <— -■night., ^see-stars.}, {<— see-stars, 
day, see-stars.}) 

Second, we obtain the signed formula: 

(^see-stars v ^day) a see-stars. 

Using format of (Ariely, Denecker, Nuffelen & 
Bruynooghe, 2004) we obtain: 

(— S t v— S H )a— S . 

v see-stars day 7 see-stars 

it is equivalent to: 



(S t vSJaS . 

v see-stars day 7 see-stars 



Third, we calculate v R the valuation that is associ- 
ated with R, obtaining: 



As we can see, this is the wanted result. 

Now, we show another example using the method 
presented in (Eiter, Fink, Sabattini & Thompits, 
2000). 



v R (S f ) = 1 

v see-stars 7 

v R (S day ) = 1 
v R (S , ) = 1 
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Therefore, we have: ({see-stars}, {day}) 
This means that the suggested update is: 



Insert — 
Retract 



{see-stars} 
- {day} 



this is, as desired. 



time. Also, this proposal opens the possibilities for 
building real-life applications, like intelligent agents 
whose rational component is modelled by a knowledge 
base, which is in turn maintained using update logic 
programs. 



FUTURE TRENDS 
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KEY TERMS 

Beliefs: An agent whose knowledge base is the 
theory T believes F if and only if F belongs to every 
intuitionistically complete and consistent extension of 
Thy adding only negated literals. 

Equivalence: Two programs are equivalent if they 
have exactly the same answer sets. 

Intelligent Agent: An intelligent agent is a com- 
ponent of software (or hardware) that it perceives and 
it acts autonomously in an open and dynamic environ- 
ment, learning and cooperating with other agents (the 
same user) to offer a benefit to their user. 
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Principle of Irrelevance of Syntax: The meaning 
of the knowledge that results from an update must be 
independent of the syntax of the original knowledge, 
as well as independent of the syntax of the update 
itself. 

Strong Equivalence: (Lifschitz, Pearce & Valverde, 
2001). We say that P 1 and P 2 are strongly equivalent 
if for every program P, P 1 u P and P 2 u P have the 
same answer sets. 

Update: Let P be the program representing the 
current knowledge base, if it is updated by another 



program U, then P u is a program updated of P if only if 
the models of P y are the result of updating each of the 
models of P according to a given semantics S; to each 
of these models apply the update request U to obtain a 
new set of models M; P y is any logic program whose 
models are exactly M. 

Weak Irrelevance of Syntax: T 1 = T 2 implies Bel(K 
V iy = Bel(K V T 2 ), where K, T t and T 2 are any theo- 
ries, Bel(T) defines the set of answer sets of T, V is the 
update operator, and understanding that equivalence 
means that both programs (T 1 and T 2 ) have the same 
answer sets. 
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INTRODUCTION 



BACKGROUND 



The prediction of hourly solar radiation data has 
important consequences in many solar applications 
(Markvart, Fragaki & Ross, 2006). Such data can be 
regarded as a time series and its prediction depends 
on accurate modeling of the stochastic process. The 
computation of the conditional expectation, which is in 
general non-linear, requires the knowledge of the high 
order distribution of the samples. Using a finite data, 
such distributions can only be estimated or fit into a 
pre-set stochastic model. Methods like Auto-Regres- 
sive (AR) prediction, Fourier Analysis (Dorvlo, 2000) 
Markov chains (Jain & Lungu, 2002) (Muselli, Poggi, 
Notton & Louche, 2001) and ARMA model (Mellit, 
Benghanem, Hadj Arab, & Guessoum, 2005) for de- 
signing the non-linear signal predictors are examples to 
this approach. The neural network (NN) approach also 
provides a good to the problem by utilizing the inherent 
adaptive nature (Elminir, Azzam, Younes, 2007). Since 
NNs can be trained to predict results from examples, 
they are able to deal with non-linear problems. Once 
the training is complete, the predictor can be set to 
a fixed value for further prediction at high speed. A 
number of researchers have worked on prediction of 
global solar radiation data (Kaplanis, 2006) (Bulut & 
Buyukalaca, 2007). In these works, the data is treated in 
its raw form as a 1 -D time series, therefore the inter-day 
dependencies are not exploited. This article introduces 
a new and simple approach for hourly solar radiation 
forecasting. First, the data are rendered in a matrix to 
form a 2-D image-like model. As a first attempt to test 
the 2-D model efficiency, optimal linear image predic- 
tion filters (Gonzalez, 2002) are constructed. In order to 
take into account the adaptive nature for complex and 
non-stationary time series, NNs are also applied to the 
forecasting problem and results are discussed. 



This article presents a two-dimensional model approach 
for the prediction of hourly solar radiation. Before 
proceeding with the prediction results, the following 
technical background is provided. Using the described 
tools, the approach is tested with optimal coefficient 
linear filters and artificial NNs (Hocaoglu, Gerek & 
Kurban, 2007). 

The 2-D Representation of Solar 
Radiation Data 

The collected hourly solar radiation data is a 1-D dis- 
crete-time signal. In this work, we render this data in 
a 2-D matrix form as given in equation 1. 



Rad = 



A ll ■ • • A ln 



y ... y 

V ml A mny 



(1) 



where the rows and columns of the hourly solar radia- 
tion matrix indicate days and hours, respectively. Such 
2-D representation provides significant insight about 
the radiation pattern with time. First surface plot of the 
data is obtained then image view of the data is obtained 
and given in Fig 1 . 

By inspecting the image version of the data in Fig. 
1, it is easy to interpret daily and seasonal behavior of 
solar radiation. Dark regions of the image indicate that 
there is no sun shine on horizontal surface. The transi- 
tion from black to white indicates that solar radiation 
fall on horizontal surface is increasing or decreasing. 
During winter time, the dawn to dusk period is shorter, 
producing a narrower protruding blob. Conversely, the 
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Figure 1. Image view of solar radiation data 




:c := 



white blob is wider during summer times, indicating 
that the day-time is longer. The width behavior of the 
white blob clearly indicates the seasonal changes of 
sun-light periods. The horizontal and vertical correla- 
tions within the 2-D data are quite pronounced. This 
implies that, given the vertical correlation among the 
same hours of consecutive days, it is beneficial to use 
2-D prediction for hourly forecasting. The prediction 
efficiency of the proposed model is illustrated with 2-D 
optimum linear prediction filters and NNs. 

Optimal 2-D Linear Prediction Filter 
Design 

Due to predictive image coding literature, it is known 
that a 2-D matrix can be efficiently modeled by linear 
predictive filters (Gonzales, 2002) (Sayood, 2000). The 
prediction domain is a free parameter determined ac- 
cording to the application. Consider a three coefficient 
prediction filter structure as given in expression 2: 



(2) 



c 


\j+l 


*; + i,,- 


A 

X/+1J+1 = ? 



The linear filter coefficients a v a 2 and a 3 are opti- 
mized, and the prediction result x i+1 , j+1 is estimated as 



*; +u+ i = */r a i + x iu + ir a 2 + x a + i)j- a 3 
The prediction error for this term is: 



*i+l,j+l X i+l,j+l X i+l,j+l 



(3) 



(4) 



The total error energy corresponding to the whole 
image prediction can be calculated as: 



i=2 j=2 



£;• 



(5) 



where m and n correspond to the width and height of 
the image, which are, for the solar data, 365 and 24, 
respectively. The filter coefficients that minimize this 
function can be found from the solution of the mini- 
mization derivative equation: 



de _ de de 
da x da 2 da 3 



(6) 
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The solution to equation 6 yields the following 
matrix-vector equation: 



R n R 12 R ± 



R 21 R 2 

ivoi ivo 



4.3 


V 




V 


L 23 


o 2 


= 


r 2 


) 
L 33_ 


_ a 3_ 




_ r 3_ 



(7) 



which is compactly written as R . a = r , so the optimal 
filter coefficients can be obtained as 



(8) 



a = R 1 y 

A Brief Discussion on Learning 
Techniques of NNs 



There are several techniques to achieve high speed 
NN algorithms. Among these techniques, heuristic 
techniques were developed from an analysis of the 
performance of the standard steepest descent algorithm 
(Costa, Braga, Menezes, 2007). Among the category 
of fast algorithms, the methods use standard numerical 
optimization techniques such as conjugate gradient, 
quasi-Newton, and Levenberg-Marquard. The basic 
back propagation algorithm adjusts the weights in the 
steepest descent direction. It turns out that, although the 
function decreases most rapidly along the negative of 
the gradient, this does not necessarily produce the fast- 
est convergence. In the conjugate gradient algorithms a 
search is performed along conjugate directions, which 
produces generally faster convergence than steepest 
descent directions. Newtons method is an alternative 
to the conjugate gradient methods, which often con- 
verges faster. As a drawback, the method is complex 
and expensive for it's the Hessian matrix calculation 
in feed forward NNs. The computationally simpler 
quasi-Newton methods do not require calculation of 
second derivatives . Similarly, the Levenberg-Marquardt 
algorithm was also designed to approach second-order 
training speed without having to compute the Hessian 
matrix. There are a number of studies on different sub- 
jects that points the comparison between training algo- 
rithms (Ghaffari, Abdollahi, Khoshayand, Bozchalooi, 
Dadgar & Tehrani, 2006) ( Srinivasulu & Jain, 2006) 
(Pereda, Lope & Maravall, 2006). Since Levenberg- 
Marquardt algorithm supplies faster convergence it is 
adopted and used in this article. 



SOLAR RADIATION DATA 
FORECASTING RESULTS 

In order to reduce computational complexity and to 
focus to the proposition, relatively short 1-D and 2- 
D prediction filters are used in this work. The filter 
templates are given in Fig. 2. These templates are also 
widely used in predictive image and signal coding. 

For the minimum RMSE linear prediction, the 
optimal coefficients are analytically determined by 
solving Eq. 8. The 2-D image data is fed to the pre- 
diction system, and error figures are obtained for each 
hour. The error figure for 2-D 3 -tap optimum filter is 
given in Fig. 3. 

As a second step prediction model, two NN structures 
are applied to the data. In the first structure, the input is 
treated as 1-D, and the input network elements are i th , 
i+l st and /+2 nd elements of the data, where the output 
is the i+3 th element for each sample in the data. In the 
second structure, the proposed 2-D image matrix form 
is used. The inputs of the networks are i,j th , i+l,f h and 
i,j+l st elements of the 2-D data matrix and the output 
is i+l,j+l st element of the data matrix for each z and 
j. A 2-month period is used for testing. 

The sigmoid function and the gradient descent 
algorithm with Levenberg-Marquard modification 
are used during learning process with three neurons at 
the hidden layer. To accelerate the speed of learning 
process a momentum term is used and is updated by 
a fraction of the previous weight update to the current 
one. After the learning phase, the network is simulated 
by the remaining image data and error samples are 
obtained (Fig. 4). 

Root Mean Square Error (RMSE) values that are 
obtained from proposed optimum linear prediction 
filters and NNs are presented in Table I. The correlation 
coefficients between actual data values and predicted 
data values are also tabulated here. 



FUTURE TRENDS 

2-D representation has potential uses for different 
meteorological parameters and different models such 
as surface matching, clustering based classification, 
etc. Dynamical time varying behavior of the model 
may also be analyzed. Such analysis can be regarded 
as future works of this study. 
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Figure 2. 1-D and 2 -D prediction templates used for 
modeling the image 



1D predict, temp. 



X 11 X 12 X 13 X 1n 



2D predict, temp. 



\r 



To emphasize the efficiency of the proposed 2-D rep- 
resentation, two feed-forward NN structures, one for 
1-D modeling and the other for the 2-D, are built and 
trained by the same data. The RMSE values are obtained 
as 42.012 and 38.66 for 1-D and 2-D case, respectively. 
This observation also justifies the efficiency of the 2-D 
data representation that exploits inter-day dependencies 
of the solar radiation pattern. Furthermore, it is clear 
that the 2-D NN structure provides better prediction 
than the optimum linear filter. 
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Figure 3. Error image obtained from 2-D optimal linear filter 
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Figure 4. Test error image obtained from feed forward BP-NN 
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KEY TERMS 

2-D Data Representation: A matrix containing 
vertical and horizontal indexes can also be considered 
as a 2-D image. A 2-D representation does not have 
to correspond to an image acquired by a camera or an 
imaging device. Here, the representation is used for 
the compact visualization of the solar data. 

Artificial Neural Networks: A network of many 
simple processors ("units" or "neurons") that imitates 
a biological neural network. The units are connected 
by unidirectional communication channels, which 
carry numeric data. Neural networks can be trained 
to find nonlinear relationships in data, and are used 
in applications such as robotics, speech recognition, 
signal processing or medical diagnosis. 



Backpropagation Algorithm: Learning algorithm 
of ANNs, based on minimising the error obtained from 
the comparison between the outputs that the network 
gives after the application of a set of network inputs 
and the outputs it should give (the desired outputs). 

Optimal Coefficient Linear Filters: A linear pre- 
dictor takes a linear combination of past values in a time 
series, and assigns this combination as the prediction 
value. While taking the linear combination, the scales 
of each past sample should be calculated in a way that 
the prediction error has minimum amount of energy. 
Such a set of scales are called optimal coefficients of 
a linear filter. 

Prediction Error : Difference between the actually 
measured and previously f orcasted value of a time-series 
data. Commonly represented in terms of RMSE. 

RMSE : Root-Mean-Squared Error. A quantitative 
error measure that defines the error between two sets 
of data as one-by-one differencing, squaring each dif- 
ference, adding the squared terms, and finally taking 
the square root. 

Solar Radiation: Radiant energy emitted by the 
sun from a nuclear fusion reaction that creates elec- 
tromagnetic energy. 
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INTRODUCTION 

It is possible to implement help systems for diagnosis 
oriented to the evaluation of the f onator system using 
speech signal, by means of techniques based on expert 
systems. The application of these techniques allows the 
early detection of alterations in the f onator system or 
the temporary evaluation of patients with certain treat- 
ment, to mention some examples. The procedure of 
measuring the voice quality of a speaker from a digital 
recording consists of quantifying different acoustic 
characteristics of speech, which makes it possible to 
compare it with certain reference patterns, identified 
previously by a " clinical expert ". 

A speech acoustic quality measurement based on 
an auditory assessment is very hard to assess as a 
comparative reference amongst different voices and 
different human experts carrying out the assessment 
or evaluation. 

In the current bibliography, some attempts have been 
made to obtain objective measures of speech quality 
by means of multidimensional clinical measurements 
based on auditory methods. Well-known examples are: 
GRBAS scale from Japon (Hirano, M.,1981) and its 
extension developed and applied in Europe (Dejonckere, 
P. H. Remade, M. Fresnel-Elbaz, E. Woisard, V. Crevier- 
Buchman, L. Millet, B.,1996), a set of perceptual and 
acoustic characteristics in Sweden (Hammarberg, B. & 
Gauffin, J. , 1 995), a set of phonetics characteristics with 
added information about the excitement of the vocal 
tract. The aim of these (quality speech measurements) 
procedures is to obtain an objective measurement from 
a subjective evaluation. 

There exist different works in which obj ective meas- 
urements of speech quality obtained from a recording 
are proposed (Alonso J. B .,2006), (Boyanov, B & Had- 
jitodorov, S., 1997),(Hansen, J.H.L.,Gavidia-Ceballos, 
L. & Kaiser, J.F., 1998),(Stefan Hadjitodorov & Petar 
Mitev, 2002),(Michaelis D.; Frohlich M. & Strube H. 



W. ,1998),(Boyanov B., Doskov D., Mitev P., Hadji- 
todorov S. & Teston B.,2000),(Godino-Llorente, J.I.; 
Aguilera-Navarro, S. & Gomez-Vilda, P. , 2000). 

In these works a voiced sustained sound (usually 
a vowel) is recorded and then used to compute speech 
quality measurements. The utilization of a voiced sus- 
tained sound is due to the fact that during the production 
of this kind of sound, the speech system uses almost 
all its mechanisms (glottal flow of constant air, vocal 
folds vibration in a continuous way, ...), enabling us 
to detect any anomaly in these mechanisms. In these 
works different sets of measurements are suggested in 
order to quantify speech quality obj ectively. In all these 
works one important fact is revealed; it is necessary 
to obtain different measurements of the speech signal 
in order to compile the different aspects of acoustic 
characteristics of the speech signal. 



BACKGROUND 

A speech recording gives different characteristics of 
the speech quality of a speaker. The recorded speech 
signal can be represented in different domains. Each 
domain shows some of the speech characteristics in a 
preferential way. The main domains studied in speech 
processing are: 

Time Domain 
Spectral Domain 
Cepstral Domain 
• Inverse Model Domain 

Most works in digital speech signal processing are 
based on these domains. However, other works use 
new domains derived from the former ones. 

In the following section the most important features 
of each domain are described. 
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Time Domain 



Spectral Domain 



A high quality speech signal possesses a more regular 
envolope than a low quality speech signal. This fact is 
more evident in short time intervals. The main phenom- 
ena that enable us to distinguish between high quality 
speech and low quality speech are: 

The energy of the speech signal in a short time in- 
terval changes considerably between two consecutive 
intervals in low quality speech whereas in high quality 
speech there is a less change in energy. 

In low quality speech unperiodicity (without per- 
diodicity) intervals during voiced sustained speech 
appear. 



A low quality speech (a voiced sustained sound ) has 

the following characteristics: 

Less regularity of the spectral envelope, mainly 

in low frequencies. 

More percentage of energy in low frequencies 

with regard to the total energy. 

Energy blocks in high frequencies. These blocks 

are caused by glottal noise. 

A great change of the power spectrum from a 

frame with regard to contiguous frames. 



Figure 1. Speech signal in time domain: the five Spanish and sustained vowels are illustrated. The upper figure 
is a speaker with high quality speech. The lower figure is a speaker with low quality speech. 



Healthy speech 




Pathological speech 




Figure 2. Sustained voiced sound during a short time interval from a high quality speech (left) and from a low 
quality speech (right). 








Time (s) 



Time (s) 
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Figure 3. Estimated specgram from a high quality speech (top) and from a low quality speech (bottom). The five 
Spanish vowels are pronounced. The sample frequency is 22050 Hz. 




Specgram of a high quality speech 




Time (5) 



ram of a low quality speech 




High quality speech concentrates its energy around 
certain formants, mainly the first and the third form- 
ants, whereas low quality speech has noise components 
around the formants. 

High quality speech has great spectral wealth. 
However, low quality speech has a little amount of 
harmonic component, mainly concentrated in very 
low frequencies. 

The amount of spectral wealth is a characteristic of 
the voice of a certain speaker. However, the spectral 
wealth variation in time (during the production of a 
sustained voiced sound) is indeed an indicator of low 
quality speech. 

Another characteristic in low quality speech (during 
the production of a sustained voiced sound) is its vari- 



ations in vibration rhythm of vocal folds, i.e. frequency 
variation in pitch frequency. 

Cepstral Domain 

Characteristics to evaluate the speech quality can 
be identified in the cepstral domain: envelope of the 
spectrum, spectral wealth, harmonics and noise com- 
ponents identification, etc. A sustained voiced sound 
with three times the pitch period in length is used in 
the cepstral domain. 

The spectral wealth of a speech signal can be 
quantified by the amplitude and width of the cepstral 
component of the pitch frequency. The existence of a 
peak with great amplitude indicates the presence of 
notorious energy in that harmonic component. This 
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Figure 4. A high quality speech (top) and a low quality speech (bottom) in the corrected power cepstrum. 



Rectified power cep strum: high quality speech 
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is a characteristic of high quality speech. A reduced 
width of the cepstral peak corresponding to the pitch 
indicates the high stability of the pitch frequency for 
three consecutive periods of pitch. This is also a char- 
acteristic of high quality speech. Characteristics such 
as amplitude and width of cepstral peak corresponding 
to the second harmonic can be also used to distinguish 
between high and low quality speech. High quality 
speech possesses a cepstral peak corresponding to the 
first harmonic narrower than the cepstral peak corre- 
sponding to the second harmonic. 

Glottal noise in speech signal can be estimated by 
means of the relationships among different regions in 
the cepstral domain: the harmonic component (ceptrals 
components of pitch and its harmonics) and noise com- 
ponent (the remaining ceptrals components). 

Inverse Model Domain 

In this domain, the waveform of air pulse produced by 
the vocal folds during the production of a sustained 
voiced sound is estimated. The air pulse is also called 
residual signal or glottal flux. The estimatation is 
obtained with an inverse filter applied to the speech 



signal in order to eliminate the vocal tract effect and 
lips radiation effect. 

The quality of speech can be quantified by some 
features of the glottal signal such as values of ampli- 
tude, time in which vocal folds start to open, time in 
which vocal folds are completely open, time in which 
the closing phase of vocal folds starts and different re- 
lationships between different times in the glottal cycle: 
open quotient, speed quotient, closing quotient, etc. 

Non Linear Domain 

The main comercial systems to evaluate speech qual- 
ity from a recording objectively (Dr Speech (Tiger 
Elemetric), SSVA (System for Sigle Voice Analysis), 
MDVP (Multi-Dimensional Voice Program) ,EVA 
(Evaluation Vocal Assistee), CSL (Computerized 
Speech Laboratory) PRAAT, VISHACSRE (Comput- 
erized Speech Research Environment), MEDIVOZ, 
etc) do not assess nonlinear characteristics in speech 
signal. 

The most popular model of characterization of the 
production voice system is a time-variant system, based 
on linear acoustics theories. It consists of a source/filter 
model. The existence of variations in spectral amplitude 
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of speech signal in the fundamental frequency leads us 
to cvonsider a nonlinear behaviour of speech signal. 

A fundamental frequency (f s) and a subharmpnic 
(fs/2) have been identified in the speech signal (Xue- 
jing Sun & Yi Xu, 1995). A subharmonic effect is the 
amplitude modulation or/and the frequency modula- 
tion. Other authors indicate that 31% of speakers with 
pathological speech have subharmonics in speech. 
However, the existence of subharmonics has also 
been identified in high quality speech (Haben, Kost 
& Papagiannis ,2003), . It is estimated that 10.5 % of 
speakers with healthy speech have subharmonics . not 
being a symptom of an anomalous speech. 

There exist two theories to justify the presence of 
subharmonics : 

Titza theory (TItze, IR., 1994): subharmonics are 

due to mechanical or geometric asymmetries between 
vocal folds. 

Svec theory (Svec JG, Schutte HK, Miller DG, 1 996): 
subharmonic frequency is due to the combinations of 
two vibrational modes (biphonation: the presence of 
two main frequencies) whose frequencies have a 3:2 
relation. 

Nevertheless, both theories are the same according 
to (Neubauer J., Eysholdt P., Eysholdt U., Herzel H., 
2001), where the authors point out that biphonation 
is due to asymmetry between the left and right-hand 
vocal folds or to desynchronization in the back-forth 
vibration, (Haben CM., Kost K. & Papagiannis G., 
2003). Assymmetry and desynchronization are caused 
by differences in masses and viscoelastic properties 
between the vocal folds. This can be modelled by 
nonlinear phonation. In (Ay ache S., Maurice Ouak- 
nine, Dejonkere P., Prindere P. & Giovanni A., 2004) 
nonlinear models are suggested in order to explain 
the effect of mucus viscosity of vocal folds (mucus in 
vocal folds surface generates superficial tension and 
causes adhesion). 

In the traditional model of the vocal tract, sound wave 
propagation is assumed to be plain wave propagation. 
However, sound pressure measurements and volume 
variation measurements are better fitted to a nonlinear 
model of dynamics fluid. This stems from turbulences 
(or even periodic turbulences) produced by cavities 
between the vocal folds and the false vocal folds. This 



turbulence excites the vocal tract in the closing phase 
of vocal folds. 

Fractal dimension has been studied by some authors. 
They conclude that high quality speech signal has a 
low dimensionality. It is stated (Orlikoff R.F., Baken R. 
J., 2003), that the amount of alinearities in the speech 
system is an indicator of anormal phonation and it has 
been suggested that phase space dimensionality, used 
for the attractor characterization, could be related to 
the amount of mass of vocal folds. 



QUALITY SPEECH QUANTIFICATION 

In the previous section a description of the different 
features of speech signal in different domains has been 
given. Theses features permit us to evaluate the speech 
quality. Each feature characterizes a physical phenome- 
num that is involved in voice production. A physical 
phenomenum can appear in different domains. In this 
work a set of physical phenomena to make a correct 
documentation of voice quality has been identified. 
The four physical phenomena identified are: 

Voice stability: this is the ability of a speaker to 
create a constant intesity air flux in order to excite the 
vocal folds (during a sustained voiced sound). This 
physical phenomenum is quantified from measurements 
of speech stability. 

Spectral wealth: this is the ability to generate a peri- 
odical movement in the vocal folds (during a sustained 
voiced sound) and produce a voiced excitation of the 
vocal tract with a great amount of spectral components. 
This physical phenomenum is quantified computing the 
pitch frequency stability and by the number of harmonics 
with high energy in different frequency bands. 

Presence of noise: this is related to the presence of 
glottal noise in speech signal during the phonation of 
a sustained voiced sound. The presence of glottal noise 
is due to problems in the closing phase of vocal folds. 
This physical phenomenum is quantified by measuring 
the presence of nonstationary noise in speech signal. 

Vocal folds irregularities: alinearities in speech 
system are caused by an anomalous working of vocal 
folds. This is due to irregularities in masses involved 
in closing phase of vocal folds, asymmetric movement 
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of vocal folds, and factors related to vocal folds mucus. 
These phenomena are quantified by means of nonlinear 
behaviour of speech signal. 

An anormal speech shows unless one of the val- 
ues corresponding to the quantification of the four 
physical phenomena is out of the normal range. This 
procedure of quality speech quantification permits us 
to identify anomalous speech qualities from diverse 
origins. These four kinds of physical phenomena can 
be quantified in different domains, existing different 
objective measurements of speech quality which are 
capable of quantifying with more or less accuracy a 
single physical phenomenum. 



FUTURE TRENDS 

In general, it is impossible to identify pathology in 
the fonator system using only a speech recording . 

This is stated by various authors. This stems from the 
fact that the acoustic characteristics of two speakers 
with different pathologies in the fonator system can 
be similar. Even in a visual inspection of the larynx 
the identity of the pathology cannot be determined. 
Furthermore, the coexistence of more pathology of the 
fonator system is also frequent. 

Nevertheless, several works have focused on identi- 
fying the presence of anomalies in the fonator system . 
An automatic detection system of anomalies in speech 
system has the same diagram as a voice recognition 
system (see Figure 5). 

The "Voice acquisition" block digitalizes speech 
signal. In this block, a discrimination between speech 
signal and noise is usually made and the segmentation 
of speech signal in frames is also carried out. 

In the "Parameterization" block the speech quality 
is quantified using diverse quality measurements for 
each frame into which the speech signal is divided. 
The quantification enables us to identify differential 
characteristics among the different classification units . 



In our case, the classification units are healthy and 
pathological speech. In this block, each speech frame 
is turned into a characteristics vector (or measurements 
vector). Some measurements average certain quantifica- 
tions of an acoustic characteristic or evaluate its time 
evolution during the phonation. 

An automatic classification of the characteristics 
vector is made in the "classification" block. The clas- 
sification systems include Support Vector Machines, 
Neural Networks, etc.. In our case, the classification for 
each characteristics vector is between healthy speech 
and pathological speech. 

We propose carrying out clinical studies in order to 
assess the usefulness of speech quality quantification 
automatic systems in speech therapy, otolaryngology 
and phoniatry. These studies will permit the application 
of the proposed protocol to measure the speech quality 
in fields such as the assessment of a surgical operation, 
documentation of a treatment evolution, medical-legal 
documentation and telemedicine. 

It will be possible to implement automatic clas- 
sification systems between healthy and pathological 
speech from databases with different qualities of speech, 
or even systems capable of automatically giving a 
measurement of the level of disphonia. These systems 
can be used in a screening evaluation or in a speech 
therapist evaluation. 



CONCLUSION 

In this work, the different physical phenomena which 
characterize voice quality have been identified. These 
phenomena have been quantified in order to obtain a 
correct documentation of voice quality . The quantifica- 
tion of nonlinear behaviour in signal speech has been 
introduced to describe in a more realistic way the vocal 
folds behaviour. 

Voice quality quantification allows for the imple- 
mentation of systems to help in pathologies diagnosis in 



Figure 5. Diagram block of the automatic detection system of anomalies in the fonator system 
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the f onator system by means of supervised automatic 
recognition systems such as Support Vector Machines 
(SVM) or Neural Networks (NN). 

Advances in voice quantification applied to the voice 
synthesis field will improve naturality in the produc- 
tion of synthetic voices. The development of automatic 
mood detection is a possibility (for example, detection 
of sadness, anger or happiness) with the application of 
the knowledge acquired in measurements of voice qual- 
ity. With these systems it will be possible to perceive 
no verbal language. These systems can be applied to 
new generations of human-computer interfaces. 
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KEY TERMS 

Characterization or Representation Domains: 

These are the different spaces into which a signal can 
be transformed, where certain characteristics of this 
signal (levels of regularity, levels of noise, similarities, 
etc) are pronounced of preferential form. 

Diagnosis Automatic Intelligent Systems: These 
are systems which enable the identification of patho- 
logical states without the presence of a clinical expert . 
These systems are oriented to preventive medicine or 
first screening. 

Disphonia: This is the alteration of voice quality . It 
is mainly caused by laryngeal pathologies. Other dif- 
ferent motives to those of a medical nature can produce 
changes in voice quality , such as, for example, factors 
related to mood. 



GRBAS: Objective measures of speech quality 
by means of multidimensional clinical measurements 
based on auditory methods. 

Help Systems for Diagnosis: These are systems 
that help the clinical professionals to identify certain 
situations that need special attention. They are used 
generally in tasks of clinical monitorization. 

Laryngeal Pathology: Due to different organic 
injuries (such as malformations, benign injury, in- 
flammations, infections, precancerous and cancerous 
injuries, traumatisms, or endocrine, neurological and 
auditive injuries), different functional disphonies (in 
spoken and sung voice) and of psychiatric origin. 

Pitch: Vibration frequency of vocal folds. In fact, 
there is not complete periodicity in the vibration of 
vocal folds. That is why it is said that vocal folds have 
a quasiperiodic movement. 
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INTRODUCTION 

Today, advances in Computer Science and the prolifera- 
tion of computers in modern society are an unquestion- 
able fact. Nevertheless, the continuing importance of 
orthography and the hand-written document are also 
beyond doubt. 

The new technologies permit us to work with on- 
line information collecting, but there is still a large 
quantity of information in our society which requires 
using algorithms for samples off-line. Security in cer- 
tain applications requires having biometric systems 
for their identification; in particular, banking checks, 
wills, postcards, invoices, medical prescriptions, etc, 
require the identity of the person who has written them 
to be verified. The only way to do this is with writer 
recognition techniques. 

Furthermore, many hand-written documents are 
vulnerable to possible forgeries, deformations or cop- 
ies, and generally, to illicit misuse. Therefore, a high 
percentage of routine work is carried out by experts and 
professionals in this field, whose task is to certify and 
to judge the authenticity or falsehood of handwritten 
documents (for example: wills) in a judicial procedure. 
Therefore nowadays research on writer identification 
is an active field. 

At present, some software tools enable certain 
characteristics to be displayed and visualised by experts 
and professionals, but these experts need to devote a 
great deal of time to such investigations before they 
are able to draw up conclusions about a given body of 
writing. Therefore, these tools are not time-saving and 
nor do they provide a meticulous analysis of the writing. 
They have to work with graph paper and templates in 
order to obtain parameters (angles, dimensions of the 
line, directions, parallelisms, curvatures, alignments, 
etc.). Moreover, they have to use a magnifying glass 



and graph paper in order to measure angles and lines. 
This research aims to lighten this arduous task. 



BACKGROUND 

Writer identification is possible because the writing 
for each person is different, and everyone has intrinsic 
characteristics. The scientific bases for this idea come 
from the human brain. If we attempt to write with our 
less skilful hand, there will be some parts or strokes 
very similar to the writing which we make using our 
skilful hand. This is because the brain sends the com- 
mands for carrying out the writing and not the hands. 
Generally, this effect is proj ected toward the writing 
by two types of forces, which are: 

Conscious or known: because it is controlled by 
the individual's own free will. 
Unconscious: because it escapes the control of 
the individual's own free will. This is divided 
into forces of mechanical and emotional means, 
which behaviour feelings. 

Everybody writes using their brain, and simultane- 
ously the handwritten impulse, which is the symbolism 
of the space in order to obtain the dimensions of the 
writing, is adapted proportionally, the size of the text 
being maintained or modified depending on whether 
the individual is forced to write in a reduced space. 

Nowadays, writer identification is a great challenge 
because such research work has not been as fully de- 
veloped as that of identification based on fingerprints, 
hands, face or iris (other biometric techniques), due 
mainly to the fact that the operation of the brain is 
very difficult of parameterize. On the other hand, the 
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above-mentioned techniques use widely researched 
biometric information. 

Most of the characteristics implemented offer in- 
formation in the vertical and horizontal plane (Zhenyu, 
Bin, Jianwei, Yuan, & Xinge, 2005) (Zhenyu, Yuan, & 
Xinge, 2005) (Schlapbach, & Bunke, 2006) (Bulacu, 
& Schomaker, 2005). We have introduced a new pa- 
rameter, the proportionality index, which projects in 
all directions, depending on the selected points. 



OFF-LINE WRITER IDENTIFICATION 
SYSTEM 

As with the majority of the works proposed to the 
present date on biometric recognition, the framework 
of the system depends on the basic steps showed in 
figure 1 . The images acquisition is a previous step to 
this system; therefore, this system is an off-line system. 
The data have to be scanned or photographed in order 
to build our database. 

Data Acquisition 

The forensic analysis of hand-written documents 
requires an extensive database of a known writer's 
hand- written samples. Therefore samples are gath- 
ered of different writers' writing and in turn several 
samples are taken of each one owing to the temporary 
invariability. 



The creation conditions of a database have to be 
normalized with different types of paper, pen, and 
similar place of support (for doing the writing) because 
our work is centred on the writing and the efficiency 
of proposed parameters. For these off-line systems, the 
documents have been generated, and therefore, for the 
building of the database, the system has to be scanned 
or a high resolution picture taken. 300dpi on grey scale 
(8 bits) is a good threshold. 

Image Pre-Processing and Segmentation 

The first step of the image pre-processing consists of 
utilizing Otsu's method (or another method), which 
permits us to determine the necessary grey threshold 
value to carry out the binarization of the samples 
(Otsu, 1979). 

As result of the binarization, in most cases, the line 
of writing remains with irregular appearance. For this 
reason, another pre-processing step is carried out, which 
enables the line to be smoothed out, thus remaining 
well defined. This also eliminates the existing noise 
in the images after the scanning process. 

As previous step to the separation of words or con- 
nected components, the detection and elimination of 
the punctuation marks (full stops, accents and commas 
etc.) is carried out. 

Finally, the words which compose the lines of writ- 
ing are segmented (baselines) and for this, it is neces- 
sary to establish limits for each of the words. For this 



Figure 1. System of writer identification 
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estimation, the method of the "Enclosed Boxes" (Ha, 
Haralick & Phillips, 1995) was used, which provides 
us the coordinates that will allow us to segment the 
words. The enclosed boxes are defined as the minimum 
rectangle that contains the connected component. 

Feature Extraction 

The calligraphic expert's task is usually to make a 
statistical list of different quantitative and qualitative 
measurements carried out on the document in question, 
and to present this as evidence in a judgment. These 
include order, legibility, construction of letters, con- 
nection, dimension, slant of the letters, space among 
words and among characters, alignment and skew 
of the baseline, initial and final stroke, continuity of 
the stroke, punctuation, control and movement of the 
ball-point pen. 

The developed systems compare the document 
being tested with the samples of the database, using 
image digital processing in order to extract the features 
defined by the system. 

We can define three different kinds of features, local, 
global and texture features (see figure 1). 

The local features examine the construction of the 
individual characters to identify details of certain let- 
ters, since it is considered that it is very difficult for 
a writer to change the way of writing her/his letters. 
One of the techniques consists of dividing into regions 
the images of the segmented letters and then for each 
region to calculate the direction of the gradient; also 
a geometric description can be obtained by analysing 
the presence of corners, diagonal, vertical and horizon- 
tal lines, direction and angle of the edges and hinges 
(Zhang, Srihari, & Lee, 2003). 

Another way to describe letters is through the pair 
of coordinated (x,y) of the contour of the connected 
components and as each writer is considered a gen- 
erator of a finite number of basic patterns formed by 
these connected components; it can be characterized 
by the discrete probability density function of emis- 
sion of a basic pattern of the strokes (Schomaker, & 
Bulacu, 2004). Another similar method detects the 
morphological invariants using an automatic classifier 
of grapheme; in (Bensefia, Pasquet, & Heutte, 2002), 
the authors have shown that the variability of the writ- 
ing can be measured through these invariants because 
each writer writes the same letters using such patterns 
or graphemes. 



The global features try to describe the properties 
of the writing and they are statistical measurements 
extracted from the whole sample of the handwritten 
document, paragraphs, lines and words to identify 
(Grening, Sagar, & Leedham, 2005) (Tomai, Zhang, 
& Srihari, 2004) (Marti, Messerli, & Bunke, 2001). In 
(Wirotius, Seropia, & Vicent, 2003) a study was carried 
out on the distribution of gray levels in the pixels of 
the stroke, calculating the curve of evolution of these 
levels along sections of the stroke observing that the 
symmetry with respect to the minimum of the curve 
presents a great variability according to the writer and 
the way in which the ball-point pen is located on the 
paper. 

In (Srihari, Cha, Arora, & Lee, 2002) the variation 
of gray levels is detected by means of its entropy, 
giving an idea of the pressure applied when writing. 
Another measurement that provides information of 
pressure, thickness of the stroke and size of the writing 
is to count the number of black pixels of the binarized 
image, which can also allow the movement of the ball- 
point pen when writing to be estimated indirectly, by 
means of the quantity average of internal and external 
contours. 

As the contours consist of connected pixel segments, 
they can be stored as a Chaincode representation where 
their vertical, horizontal and diagonals components will 
represent the formation of the stroke. 

Other global features are the average slant (Bonzi- 
novic, & Srihari, 1989), localization of the baselines 
and their skew, height of the ascending, descending 
and middle body of writing (Marti, Messerli, & Bunke, 
2001) (Romero, Travieso, Alonso, & Ferrer, 2007), 
average width of the characters, behavior of the mar- 
gins, length of the words and distance between lines 
and words. 

In order to obtain the texture features, the writ- 
ing sample is viewed as a simple image and not as a 
manuscript, and therefore each person's writing can be 
considered as a different texture; applying to it filters 
of Gabor and co-occurrence Matrixes (Said, Peake, 
Tan & Baker 1998) for example. 

In order for features to represent the writing style, 
they must fulfill the following requirement: the fluc- 
tuations in an individual's writing must be as small as 
possible, while the fluctuations among different writers 
must be as great as possible. Each one of these features 
is evaluated to determine their discrimination index 
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which allows the utility of each feature to be measured 
for the identification of writers. 

One of the biggest difficulties for automatic identi- 
fication is the handling of a great variability of writing 
styles, and there is therefore still some work to be done 
in the feature extraction stage, since the purpose of this 
stage is to detect the discriminate features of the writ- 
ing that characterize the styles of people's writing. Up 
to now the majority of the characteristics used by the 
experts are not as yet algorithmically implemented. 

In this present work, a list has been created of 
geometrical parameters of different measurements to 
analyse documents. In order for the characteristics to 
represent the style of writing, they should comply with 
the following requirement: the fluctuations in the writ- 
ing of a person should be as small as possible, while 
the fluctuations among different writers should be as 
large as possible. 

This characteristic is included in the list of the fol- 
lowing characteristics already developed (Romero, 
Travieso, Alonso, & Ferrer, 2006) (Hertel, & Bunke, 
2003): 



The quantity of black pixels and the long words 
will give us an estimation of the dimension and thick- 
ness of the line, the width of letters and the height of 
the medium body. Besides these are the distinctive 
characteristics of the style of writing. 

The estimation of the width of letters is carried 
out by seeking the row with the greatest quantity of 
transition of black to white (0 to 1). The number of 
white pixels between each transition is counted and 
this result is averaged. 

To measure the height of the medium body of the 
words, the goal is to determine the upper and lower 
baseline through maximum and minimum values, and 
to measure the distance between them. 

To approach the baselines of each word, it was 
decided to use the adjustment of minimum mean 
square error that is based on finding the equation (see 
expression 1) that is best adjusted to a set of points 
"n" (Chin, Harvey, & Jennings, 1997). The equation 
is the following: 



y = ax + b 



(1) 



length of the words, 
quantity of pixels in black, 
estimation of the width of the letters, 
height of the medium body of writing, 
heights of the ascending and descending, 
height relation between of the ascending and 
medium body, 

height relation between descending and medium 
body, 

height relation between descending and ascend- 
ing, 

height relation between medium body and the 
wide of writing. 



where the coefficients "a" and "b" determine the lin- 
eal polynomial regression by means of the following 
expressions: 



n ( n V n A 

f=l Vf=l Ai=l J 



a = 



i=l V i=l J 



(2) 






(3) 



Figure 2. Zones and baselines 
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Those values of "a" and "b", based on the coordinates 
of minimums or maximums detected in the contour 
of the word, are different baselines. Minimums are to 
approach the lower baseline and the maximums for 
the upper baseline. 

The extraction of the proportionality index is a 
new parameter in our system and in the references. 
The selection of the points is of random form but 
with some indications, and therefore we have located 
the most representative sites as being ascending ones, 
descendent, terminations, etc. In this paper the most 
representative points (red points) are displayed in figure 
3. For this same word for each writer, we have marked 
the same red points. 

The marked points are united (see Figure 3), and 
each line between two points is considered as a seg- 
ment. We have measured the Euclidean length of each 
segment obtaining a mean and a standard deviation. 
These are new and novel parameters, which provide 
information from every direction of a word. 



Classification System 

The problem of the identification of writers can be seen 
according to two different approaches (see Figure 4); 
the first approach is the verification that allows us to 
determine whether two documents were written by the 
same person or by two different people. 

The second approach is the identification that con- 
sists of recognising a writer among a set of N candidates . 
This case can be seen as a problem of classification 
of N classes. Due to the potentially great number of 
candidates, the decision is based on the measurement 
of the nearest neighbour; its advantage is that it identi- 
fies the writer directly. 

Both approaches resort to some method of similarity 
measurement or distances between the samples; and the 
system must be trained with a set of handwritten samples 
belonging to each candidate (supervised classification). 
The most commonly used classification methods are 
nearest k-neighbours (Hertel, & Bunke, 2003), Neuro- 




Figure 3. Segments obtained when points are united (proportionality index) 
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nal Network (Bishop, 1995), Hidden Models Markov 
(Juang, & Rabiner, 1992), Gaussian Mixture Models 
(Schlapbach, & Bunke, 2006), etc. 

In the following table, we can see a comparison of 
different methods, showing the type of samples, number 
of writers and its success rates. 
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Finally, off-line writer identification is an open re- 
search field, where the operation of different and new 
methods is both improving and spreading in terms of 
usage. 
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KEY TERMS 

Biometric System: This is a system which identi- 
fies individuals using behaviour or physical charac- 
teristics. 

Classification System: Learning algorithm which 
generates automatic results from a features input. This 
system generally has as many outputs as classes for 
classifying. 




1453 



State of the Art in Writer's Off-Line identification 



Feature Extraction: This is aprocess which is used 
to obtain certain characteristics which are intrinsic and 
discriminate of a thing. 

Image Pre-Processing: Set of tools applied to the 
images in order to provide other improved images for 
other tasks. 

Off-Line System: A system whose operation is 
based on data that have been acquired before of its 
operation. 

On-Line System: Asystem whose operation is based 
on data which are acquired during its operation. 

Supervised Classification: This is a system that 
generates a model using training samples with labels, 
and it uses that model in order to establish an evaluation 
or test with other samples without labels. 

Writer Identification: The application of biometric 
identification by handwriting. Full texts or just several 
words can be used. 
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INTRODUCTION 

Over the past few years, face recognition has gained 
many interests. Face recognition has become a popu- 
lar area of research in computer vision and pattern 
recognition. The problem attracts researchers from 
different disciplines such as image processing, pat- 
tern recognition, neural networks, computer vision, 
and computer graphics (Zhao, Chellappa, Rosenfeld 
& Phillips, 2003). 

Face recognition is a typical computer vision prob- 
lem. The goal of computer vision is to understand the 
images of scenes, locate and identify obj ects, determine 
their structures, spatial arrangements and relationship 
with other objects (Shah, 2002). The main task of face 
recognition is to locate and identify the identity of people 
in the scene. Face recognition is also a challenging 
pattern recognition problem. The number of training 
samples of each face class is usually so small that it is 
hard to learn the distribution of each class. In addition, 
the within-class difference may be sometimes larger 
than the between-class difference due to variations in 
illumination, pose, expression, age, etc. 

The availability of the feasible technologies brings 
face recognition many potential applications, such as 
in face ID, access control, security, surveillance, smart 
cards, law enforcement, face databases, multimedia 
management, human computer interaction, etc (Li & 
Jain, 2005). 

Traditional still image-based face recognition has 
achieved great success in constrained environments. 
However, once the conditions (including illumination, 
pose, expression, age) change too much, the perfor- 
mance declines dramatically. The recent FRVT2002 
(Face Recognition Vendor Test 2002) (Phillips, Grother, 
Micheals, Blackburn, Tabassi & Bone 2003) shows that 
the recognition performance of face images captured in 



an outdoor environment and different days is still not 
satisfying. Current still image-based face recognition 
algorithms are even far away from the capability of 
human perception system (Zhao, Chellappa, Rosenfeld 
& Phillips, 2003). On the other hand, psychology and 
physiology studies have shown that motion can help 
people for better face recognition (Knight & Johnston, 
1997; OToole, Roark & Abdi, 2002). Torres (2004) 
pointed out that traditional still image-based face rec- 
ognition confronts great challenges and difficulties. 
There are two potential ways to solve it: video-based 
face recognition technology and multi-modal identifica- 
tion technology. During the past several years, many 
research efforts have been concentrated on video-based 
face recognition. Compared with still image-based 
face recognition, true video-based face recognition 
algorithms that use both spatial and temporal informa- 
tion started only a few years ago (Zhao, Chellappa, 
Rosenfeld & Phillips, 2003). 

This article gives an overview of most existing 
methods in the field of video-based face recognition 
and analyses their respective pros and cons. First, a 
general statement of face recognition is given. Then, 
most existing methods for video-based face recognition 
are briefly reviewed. Some future trends and conclu- 
sions are given in the end. 



BACKGROUND 

From a general point of view, a complete video-based 
face recognition system includes face detection module, 
face tracking module, feature extraction module and 
face recognition module. Face detection is at the bot- 
tom layer. The task of face detection is to determine the 
spatial position and pose of the face(s). Face tracking 
is at the middle layer. It follows the continuous change 
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Figure 1. A general framework of video-based face 
recognition system 
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based face recognition usually refers to both "Video 
- Image(s)" face recognition and "Video - Video" face 
recognition, that is, with video input. 

Compared with still image-based face recognition, 
video-based face recognition can utilize the temporal 
and spatial information available in the video. It's 
widely believed that video-based face recognition is 
more promising than still image-based face recognition. 
However, there also exist some difficulties in video- 
based face recognition, such as low-resolution face 
images, large variations of scale, illumination change, 
pose change, and occasionally occlusion in video. It is 
worth noting that if the time information of video is not 
considered, the video-based face recognition becomes 
the multiple-still-images input face recognition. 



VIDEO-BASED FACE RECOGNITION 

According to the classification shown in Table 1 , four 
scenarios of face recognition will be reviewed sepa- 
rately. The emphases will be put on "Video - Image(s)" 
face recognition and "Video - Video" face recognition. 
For simplicity, the position of the face in the video is 
assumed to be known in advance. 



of face position over time. Feature extraction is at a 
higher layer. Its task is to locate the position of facial 
features such as eye, nose, etc, and pull out related 
information. Face recognition module is at the top layer. 
The face recognition module identifies or verifies the 
input face(s), with the help of databases. Figure 1 gives 
the general framework of video-based face recognition 
system, with a flowchart and some examples. 

In this article, the focus will be on the top layer of 
face recognition systems — face recognition module. 
The general statement of face recognition can be de- 
fined as: given still or video images of a scene, identify 
or verify one or more persons in the scene using a 
stored database of faces (Zhao, Chellappa, Rosenfeld & 
Phillips, 2003). The still image-based face recognition 
usually refers to the process in which the input is a still 
image. On the other side, the video-based face recogni- 
tion often refers to the process in which the input is a 
shot of video. The database can be also still image(s) 
or video. Therefore, according to different modalities 
of the input and database, four different scenarios of 
face recognition can be distinguished. Table 1 shows 
these four different scenarios of face recognition. Video- 



"Image - Image(s)" Face Recognition 

"Image - Image(s)" face recognition is the tradi- 
tional still image-based face recognition. Numerous 
still image-based face recognition methods have been 
developed during the past few decades (Zhao, Chel- 
lappa, Rosenfeld & Phillips, 2003). Among them, global 
feature matching methods, such as Eigenface (Turk & 
Pentland, 1991), Fisherface (Belhumeour, Hespanha & 
Kriegman, 1997) and Bayesian (Moghaddam, Jebara 
& Pentland, 2000); and local feature matching meth- 
ods, such as Elastic Bunch Graph Matching (EBGM) 
(Wiskott, Fellous, Krueuger & Malsburg, 1 997), are the 
widely used face recognition approaches. Recently, 3D 
deformable models (Blanz & Vetter, 2003) and Local 
Binary Pattern (LBP) (Ahonen, Hadid & Pietikainen, 
2006) are the newly-emerging methods. Traditional still 
image-based face recognition has been widely used in 
biometric authentication, information security, etc. 
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Table 1. Four different scenarios of face recognition 



^^^^^ Database 
Input ^^^^^ 


Still-Image(s) 


Video 


Still-Image 


Image - Image(s) 


Image - Video 


Video 


Video - Image(s) 


Video - Video 




"Image - Video" Face Recognition 

"Image - Video" face recognition is to identify or 
verify a given face in the stored video sequences. "Image 
- Video" face recognition is also called human face- 
based video retrieval. Typical scenes includes finding 
suspects in the recorded surveillance video or finding 
a person in the film or news video from a given face 
image. Theoretically, it should do video preprocess- 
ing first, such as shot extraction. Then, face detection 
and tracking are performed to obtain the video shot of 
every face. Face recognition is conducted in the last 
step. Due to the complex scenes in such videos (film, 
news, surveillance video), most literature focuses on 
video preprocessing phase (Arandjelovic & Zisser- 
man, 2005b). In recent years, some scholars applied 
3D model for television people retrieval (Everingham 
& Zisserman, 2004). 

"Video - Image(s)" Face Recognition 

"Video - Image(s) " face recognition can be formulated 
as follows: given a shot of video, identify or verify the 
face inside by using a still-image(s) database. With 
the wide-spread usage of video acquisition hardware, 
there exist many video sequences in the application of 
security authentication, video surveillance, etc. At the 
same time, most existing databases are still-image(s) 
database. Therefore, how to make better use of the input 
video is of important value in real applications. 

Traditional approaches can be roughly divided into 
two categories: one is to perform face tracking until 
a facial image satisfies certain rules (such as size, 
pose). Then traditional still image-based face recogni- 
tion methods are applied. The disadvantages of such 
approaches are the difficulty of defining the rules and 
not making full use of all information in the video. 
Another is to perform still image-based face recogni- 



tion for each tracked face and combine the recognition 
results (using combining rules, for example, maximum 
cumulative probability or a majority vote). The disad- 
vantages of such approaches are the randomness of 
the combining rules. 

In recent years, some researchers try to make use of 
temporal and spatial information in the video. Zhou et 
al. (2003) proposed a Bayesian framework based face 
recognition and tracking which attempts to resolve un- 
certainties in tracking and recognition simultaneously. 
A time series state space model, which characterizes 
the kinematics using a motion vector and the identity 
using an identity variable, is employed to fuse tempo- 
ral information. The joint posterior distribution of the 
motion vector and the identity variable is estimated at 
each time instant and then propagates to the next time 
instant. Marginalization over the state vector yields 
a robust estimate of the posterior distribution of the 
identity variable. The sequential importance sampling 
(SIS) algorithm was used to estimate the posterior 
distribution. SIS approximates the posterior density 
function by a set of random particles with associated 
weights. Experimental results show the effectiveness 
of the algorithm. 

Li et al. (2001, 2002) applied facial features track- 
ing and face tracking for verification. The basic idea 
is that if the input is a true face (corresponding to the 
identity in the database), the tracking trajectories of 
the facial features or face appearance are basically 
the same. The corresponding mathematical model is 
that the distribution of motion vector will have a peak 
when the face is a true input. Otherwise, the input is 
an imposter. SIS is also applied to the posterior prob- 
ability distribution of state variables. However, the 
estimated probability density needs a large number of 
particles to characterize the distribution. As a result, 
the complexity of the algorithm is increased. 
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"Video - Video" Face Recognition 

"Video - Video" face recognition refers to the cases in 
which both the input and database are shots of video. 
Based on the use of the information in the video, the 
existing literatures have the following description 
methods to represent a shot of video: 



1. 


A vector (corresponding to one frame of the 




video). 


2. 


A matrix (corresponding to all frames of the 




video). 


3. 


Probability Density Function (PDF). 


4. 


Dynamic model. 


5. 


Manifold. 



Based on the above description methods, "Video 
- Video" face recognition becomes matching of 
different description methods . Table 2 shows all possible 
similarity measures (distances) between two descrip- 
tion methods. In Table 2, d stands for the distance or 
similarity of two models. f(X) stands for probability 
addition, M(X) stands for majority voting, D(X) stands 
for the posterior probability. 

Some representative methods of "Video - Video" 
face recognition are briefly introduced below. Torres 
et al (2002) created a person specific Principal Com- 
ponent Analysis (PC A) subspace for each face in the 
database. The residual distance between the face in 
one frame and the PCA subspace is used as similarity 
measure for video indexing. McKenna et al (1997) 
employed Gaussian Mixture Model (GMM) in the 
reduced PCA subspace to describe each face class. 
The posterior probability of each face in each frame 
is computed and the cumulative probability is used as 
similarity measure. Yamaguchi et al (1998) established 
PCA subspace for both the input and database video. 
The distance between the two subspaces is determined 



by the angle between two subspaces. To further handle 
the change of illumination, gestures, facial expressions, 
etc., Fukui & Yamaguchi (2003) further proposed the 
constraint subspace that includes only the effective 
component for recognition. 

Arandjelovi et al (2005a) used GMM to learn the 
face distribution. The basis of the approach is the semi- 
parametric estimation of probability densities confined 
to intrinsically low-dimensional, but highly nonlinear 
face manifolds embedded in the high dimensional image 
space (Arandjelovic, Shakhnarovich, Fisher, Cipolla & 
Darrell, 2004). The Kullback-Leibler divergence is 
adopted as the similarity measure. 

Zhou et al. (2003) used the probabilistic model de- 
scribed in previous section. An exemplar-based learning 
is adopted to automatically select video representatives. 
The exemplar index is also employed as the state vector. 
The joint probability density distribution is estimated 
by sequential importance sampling. Finally, the iden- 
tity variable is calculated by marginalization. Liu and 
Chen (2003) proposed a video-based face recognition 
algorithm based on Hidden Markov Model (HMM) 
which incorporates both the temporal and spatial in- 
formation. Lee et al. (2003, 2005) approximated face 
manifolds by a finite number of linear subspaces and 
used temporal information to robustly estimate the 
dynamics of the linear subspaces. 

Li et al (2001a, 2001b) employed the manifold 
to represent a shot of video. A 3D shape model is 
built from 2D images, a shape-and-pose-free textures 
model and an affine geometrical model. Then, Kernel 
Discriminant Analysis (KDA) is performed to extract 
the non-linear discriminating features. The identify 
surfaces are then constructed from these discriminating 
features. Face recognition is performed by computing 
trajectory distance between the input and database 
video trajectories. 



Table 2. Similarity measures (distance) between two description methods 



Database Input 


Vector(x) 


Matrix(X) 


Probability(f) 


Dynamic Model(D) 


Manifold(M) 


Vector(x) 


d(x, x) 


d(x, X) 


m 


D(x) 


M(x) 


Matrix(X) 


d(X, x) 


d(X, X) 


f(X) 


D(X) 


M(X) 


Probability(/) 


m 


f(X) 


d(f,f) 


\ 


\ 


Dynamic Model(D) 


D(x) 


D(X) 


\ 


d(D, D) 


d(D,M) 


Manifold(M) 


Mix) 


M(X) 


\ 


d(M, D) 


d(M,M) 
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Table 3. Typical algorithms for "Video - Video" face recognition 



Authors 


Input Description 


Database Description 


Measure 


Torres et al 


Vector(x) 


PCA subspace(X) 


residual error, d(x, X) 


McKenna et al 


Matrix(X) 


GMM(f) 


cumulative probability, f (X) 


Yamaguchi et al 


Matrix(X) 


PCA subspace(X) 


angle distance, d(X, X) 


Arandjelovi et al 


PDF(f) 


GMM(f) 


K-L divergence, d(f, f) 


Zhou et al 


Dynamic Model(D) 


Exemplars(X) 


posterior probability, D(X) 


Liu et al 


Dynamic Model(D) 


HMM(D) 


posterior probability, d(D, D) 


Lee et al. 


Dynamic Model(D) 


Finite number of 
linear subspaces(M) 


posterior probability, d(D, M) 


Li et al 


Manifold(M) 


Manifold(M) 


trajectory distance, d(M, M) 




Some characteristics of the above reviewed 
algorithms are listed in Table 3. 



FUTURE TRENDS 

Video-based face recognition has been actively stud- 
ied in recent years. How to better exploit both spatial 
and temporal information in the video sequence is the 
focus point. 

An individual face manifold under various changes 
(such as expression, pose, illumination, etc) is non- 
convex and nonlinear. Effective features which can 
discriminative different classes and tolerate within-class 
variations are the key for both still image-based and 
video-based face recognition. 

Another trend is to generate a 3D face model from 
video. See (Zhang, Liu, Dennis, Cohen, Hanson, & 
Shan, 2004) for an example. The 3D face model can 
overcome the problem caused by large change of pose 
and illumination. However, the complexity of 3D 
model is high. 



CONCLUSION 

In this article, based on the classification of different 
scenarios of face recognition methods, Four groups of 
techniques — the "Image - Image(s)" face recognition, 
"Image - Video" face recognition, "Video - Image(s)" 
face recognition, "Video - Video" face recognition are 



reviewed. Most existing methods of video-based face 
recognition are surveyed. Their respective advantages 
and disadvantages are also provided. Some trends of 
video-based face recognition are summarized. In the 
future, the approaches will be further investigated to 
drive more applications. 
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KEY TERMS 

Biometric Authentication: Technologies rely on 
physical characteristics that are unique for each person 
to ascertain the identity of an individual. 

Face Detection: A computer technology that 
determines the locations and sizes of human faces in 
digital images. 

Face Recognition: Given still or video images of 
a scene, identify or verify one or more persons in the 
scene using a stored database of faces. 



Face Tracking: A computer technology that deter- 
mines the continuous location of the face(s) on each 
frame of the image sequence. 

Human Face Based Video Retrieval: A process 
that one searches the video sequences to find the face 
shot according to the query face image. 

Particle Filters: Techniques which also known 
as Sequential Monte Carlo methods (SMC), are so- 
phisticated model estimation techniques based on 
simulation. 

Sequential Importance Sampling: Avery common 
particle filter algorithm that approximates the prob- 
ability density functions by a set of random samples 
with associated weights. 

Video-Based Face Recognition: Given a video 
containing face(s), identify or verify one or more per- 
sons using a stored database. 
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INTRODUCTION 



BACKGROUND 



The optimization of a cost function which has a number 
of local minima is a relevant subject in many important 
fields. For instance, the determination of the weights of 
learning machines depends in general on the solution 
of global optimization tasks (Haykin, 1 999). A feature 
shared by almost all of the most common deterministic 
and stochastic algorithms for continuous non - linear 
optimization is that their performance is strongly af- 
fected by their starting conditions. Depending on the 
algorithm, the correct selection of an initial point or set 
of points have direct consequences on the efficiency, 
or even on the possibility to find the global minima. 
Of course, adequate selection of seeds implies prior 
knowledge on the structure of the optimization task. 
In the absence of prior information, a natural choice 
is to draw seeds from a uniform density defined over 
the search space. Knowledge on the problem can be 
gained through the exploration of this space. 

In this contribution is presented a method to estimate 
probability densities that describe the asymptotic be- 
havior of general stochastic search processes over con- 
tinuously differentiable cost functions. The relevance 
of such densities is that they give a description of the 
residence times over the different regions of the search 
space, after an infinitely long exploration. The preferred 
regions are those which minimize the cost globally, 
which is reflected in the asymptotic densities. In first 
instance, the resulting densities can be used to draw 
populations of points that are consistent with the global 
properties of the associated optimization tasks. 



Stochastic strategies for optimization are essential 
to most of the heuristic techniques used to deal with 
complex, unstructured global optimization problems 
(Pardalos, 2004). The roots of such methods can be 
traced back to the Metropolis algorithm (Metropolis, 
Rosenbluth, Rosenbluth, Teller & Teller, 1953), in- 
troduced in the early days of scientific computing to 
simulate the evolution of a physical system to thermal 
equilibrium. This process is the base of the simulated 
annealing technique (Kirkpatrick, Gellat & Vecchi, 
1983), which makes use of the convergence to a global 
minimum in configurational energy observed in physi- 
cal systems at thermal equilibrium as the temperature 
goes to zero. 

The method presented in this contribution is rooted 
in similar physical principles as those on which simu- 
lated annealing type algorithms are based. However, 
in contrast with other approaches (Suykens, Verrelst & 
Vandewalle, 1998) (Gidas, 1995) (Parpas, Rustem & 
Pistikopoulos, 2006), the proposed method considers 
a density of points instead of Markov transitions of in- 
dividual points. The technique is based in the interplay 
between Langevin and Fokker - Planck frameworks 
for stochastic processes, which is well known in the 
study of out of equilibrium physical systems (Risken, 
1984) (Van Kampen, 1992). Fokker - Planck equation 
has been already proposed for its application in search 
algorithms, in several contexts. For instance, it has been 
used to directly study the convergence of populations 
of points to global minima (Suykens, Verrelst & Vande- 
walle, 1998), as a tool to demonstrate the convergence 
of simulated annealing type algorithms (Parpas, Rustem 
& Pistikopoulos, 2006) (Geman & Hwang, 1986), or 
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as a theoretical framework for Boltzmann type learn- 
ing machines (Movellan & McClelland, 1993) (Kos- 
matopoulos & Christodoulou, 1994). In the context of 
global optimization by populations of points, it has been 
proposed that the populations evolve under the time 
- dependent version of the Fokker - Planck equation, 
following a schedule for the reduction of the diffusion 
constant D (Suykens, Verrelst & Vandewalle, 1998). 
In our approach, the stationary version of the Fok- 
ker - Planck equation is used to learn the long - term 
probability density of a general stochastic search 
process. This is achieved using linear operations and 
a relatively small number of evaluations of the given 
cost function. 



STATIONARY DENSITY ESTIMATION 
ALGORITHM 



The approach proposed in this article is based on the 
notion of an infinitely long exploration of the search 
space. In the present model setup for the search, the 
process converges to a state described by the stationary 
solution of Eq. (2) (Berrones, 2007). The form of this 
solution is of the well known Boltzmann type (Risken, 
1984) (Van Kampen, 1992). For optimization or devi- 
ate generation purposes, its direct use would imply 
a high computational cost. Instead, a form of Gibbs 
sampling is proposed in order to estimate the marginal 
probability density p(x n ) (the details of the following 
discussion can be consulted in (Berrones, 2007)). The 
one dimensional projection of Eq. (2) at t — > oo leads to 
the following equation for the conditional cumulative 
distribution, y(x 




Itj***)) 



d 2 y 1 dV dy 
dxl D dx„ dx„ 



Consider the minimization of a cost function of the 
form V(x 1 , x 2 , ..., x n , ..., x N ) with a search space defined 
overT <x <L_ . A stochastic search process for this 

l,n n 2,n r 

problem is modeled by 



dt 



dV 
dx 



+ e(t) 



(1) 



where e(t) is an additive noise with zero mean. Equa- 
tion (1), known as Langevin equation in the Statistical 
Physics literature (Risken, 19 84) (Van Kampen, 1992), 
captures the basic properties of a general stochastic 
search strategy. Under an uncorrected Gaussian noise 
with constant strength, Eq. (1) represents a search by 
diffusion, while a noise strength that is slowly varying 
in time gives a simulated annealing process. Notice 
that choosing an external noise of infinite amplitude, 
the dynamical influence of the cost function over the 
exploration process is lost, leading to a blind search. 
The model given by Eq. (1) can be interpreted as a 
nonlinear dynamical system composed by N interact- 
ing particles. The temporal evolution of the probability 
density of such a system is described by a linear differ- 
ential equation, the Fokker- Planck equation (Risken, 
1984) (Van Kampen, 1992), 



di_d^ 
dt~ dx 



ay 

dx 



+ D 



dx 2 



(2) 



y(L 1>n ) = 0, y(L 2fB )=l 



(3) 



Therefore, the estimation of the analytical form of 
y(x n | {x ^ x n }) can be achieved by the substitution of 
the expansion 



y = £acp / (x n ) 



(4) 



into Eq. (3). The distribution obtained in this way can 
be used to draw points from the conditional density 
p(x n | {x ^ x n }). According to the principles of Gibbs 
sampling (Geman & Geman, 1984), the iteration of 
the previous steps over the N variables will produce a 
population sampled from the corresponding marginal 
densities p(x n ). However, in our setup all the informa- 
tion needed to characterize the densities is contained 
in the coefficients of the expansion (4). In this way, the 
stationary marginal densities associated to the N vari- 
ables of the optimization problem, are learned through 
the averages of the coefficients over the iteration of the 
random deviate generation process. We call this basic 
procedure a Stationary Density Estimation Algorithm 
(SDEA). We have also named the method Stationary 
Fokker-PlanckMachine (SFPM) in (Berrones, 2007), 
in order to indicate its relation with other methods 
(Suykens, Verrelst & Vandewalle, 1998) that make 
use of the Fokker - Planck equation to learn statistical 
features of stochastic search processes. However, in 
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(Suykens, Verrelst & Vandewalle, 1998) the Fokker 
- Planck equation is used to study the evolution of finite 
populations of points from out of equilibrium states. 
This contrast with our approach, which estimate the 
equilibrium densities on the entire search space. 

As an example, the SFPM algorithm is tested on the 
Levy No. 5 function, an important benchmark problem 
with about 760 local minima and one global optimum 
(Parsopoulos & Vrahatis, 2002), 

5 5 

f (x) = £icos((i -IK + OX Jcos((j+l)x 2 + j) + 

i=l j=l 

Oq+1.42513) 2 + (x 2 +0.80032) 2 

(5) 

with a search space given by the hypercube [-10, 10]. 
The direct implementation of a stochastic search through 
Eq. (1) would imply the simulation of a stochastic 
dynamical system composed by two particles with 
highly nonlinear interactions. By our methodology, in 
contrast, we are able to obtain adequate densities by 
linear operations and performing a moderate number of 
evaluations of the cost function. In Fig.l the densities 
generated by 10 iterations of the estimation algorithm 
with parameters L-50 and D=200 are shown. The 
obtained densities are perfectly consistent with the 
global properties of the problem, since the known global 
optimum at the point (-1.3068, -1.4248) is contained in 
the regions with highest probability. The computational 



effort is low in the sense of the required number of 
cost function evaluations, given by 2(L-1)MN=1960. 
This is comparable to the effort needed by advanced 
techniques based on populations in order to obtain good 
quality solutions for the same problem (Parsopoulos & 
Vrahatis, 2002). Our approach, however, is not limited 
to the convergence to good solutions, but it estimates 
entire densities. The implications of this in, for instance, 
the definition of probabilistic optimality criteria, are 
currently under research by us. 



FUTURE TRENDS 

In our opinion the theory and results presented so far 
have the potential of considerably enrich the tools for 
global optimization. The characterization of optimiza- 
tion problems in terms of reliable probability densities 
may open the door to new insights into global opti- 
mization by the use of probabilistic and information 
- theoretic concepts. From a more practical standpoint, 
the proposed methodology may be implemented in a 
variety of ways in order to improve existing or construct 
new optimization algorithms. 



CONCLUSION 

This work presents a methodology to estimate the prob- 
ability density function of optimization problems with 



Figure 1. Probability densities, p(x T ) andp(x 2 ) respectively, generated by 10 iterations of the stationary density 
estimation algorithm for the Levy No. 5 function. The parameters of the algorithm areL = 50 andD = 200. The 
global optimum is in the region of maximum probability. 



0.08 
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a continuous differentiable cost function, using linear 
operations and a moderate number of evaluations of 
the cost function. The generalization to constrained 
problems appears to be straightforward. This is expected 
taking into account that the proposed method makes 
use of linear operations only. In this way, constraints 
may enter into Eq. (1) as additional nonlinear terms, 
with no essential increment in computational cost. For 
instance, combinations of sigmoidal functions can be 
used for the representation of the constraints as forces 
produced by energy barriers. 
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KEY TERMS 

Configurational Energy: Refers to the potential 
energy associated with the various forces within the 
elements of a physical system. 

Diffusion Constant: Measures the degree of ran- 
domness in a diffusion process. The diffusion constant 
is proportional to the mean square distance moved by 
particles under diffusion in a given time interval. 

Diffusion Process: Random displacement of 
particles in a physical system due to the action of a 
temperature. 

Gibbs Sampler: Aprocedure to sample the marginal 
densities from a high dimensional distribution using 
one dimensional conditional probabilities. 

Heuristic : Is any algorithm that finds a good quality 
solution to a problem in a reasonable run time. 
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Learning Machine: This term refers to the de- 
velopment of techniques for automatic extraction of 
patterns from massive data sets and to the construction 
of deductive rules. In the context of this article, this 
concept deals with the automatic learning of densities 
in global optimization problems. 

Random Deviate Generation Process: A process 
which generates random numbers according to a specific 
probability distribution. 

Search Space: This is the set of all the feasible 
solutions for an optimization problem. 

Stochastic Search: Is an optimization algorithm 
which incorporate randomness in its exploration of 
the search space. 

Thermal Equilibrium: State in which a physical 
system is described by probability measures that are 
time independent. 
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INTRODUCTION 

Alanguage model is a description of language. Although 
grammar has been the prevalent tool in modelling 
language for a long time, interest has recently shifted 
towards statistical modelling. This chapter refers to 
speech recognition experiments, although statistical 
language models are applicable over a wide-range 
of applications: machine translation, information 
retrieval, etc. 

Statistical modelling attempts to estimate the fre- 
quency of word sequences. If a sequence of words is s 
= w 1 w 2 ...w F the probability can be expressed as: 

P(s)=P(w 1 w 2 ...w k ) = 

k k 

n p (w, iv^.-.w,..! )« n p ( w t | w ; _ n+1 ...w,._ 1 ). 



k 

E 

z=l 



It is reasonable to simplify this computation by 
approximating the word sequence generation as a 
(n-1)- order Markov process (Jelinek, 1998). Bigram 
(n=2) and trigram (n=3) models are common choices. 
Although we have limited the context, such models 
have a vast number of probabilities that need to be 
estimated. The text available for building the model 
is called the 'training corpus' and, typically contains 
many millions of words. Unfortunately, even in a very 
large training corpus, many of the possible n-grams 
are never encountered. This problem is addressed by 
smoothing techniques (Chen & Goodman, 1996). 

Which is the best modelling unit? Words are a com- 
mon choice, but units smaller (or larger) than words 
can also be used. Word-based n-gram is best suited to 
modelling the English language (Jelinek, 1 998). Inflec- 
tive languages have several characteristics, which harm 
the prediction powers of standard models. 



In general, all Indo-European languages are inflective 
but a serious problem arises regarding languages which 
are inflected to a greater extent (e.g. Russian, Czech, 
Slovenian). Agglutinative languages (e.g. Hungarian, 
Finnish, Estonian) have even more complex inflectional 
grammar where, besides inflections, compound words 
are a big problem. Inflective languages add inflectional 
morphemes to words. Inflectional morphemes indicate 
the grammatical information of a word (for example 
case, number, person, etc.). Inflectional morphemes are 
commonly added by affixing, which includes prefixing 
(adding a morpheme before the base), suffixing (add- 
ing it after the base), and much less common, infixing 
(adding it inside the base). A high degree of affixation 
contributes to the explosion of different word forms, 
making it difficult, even impossible, to robustly estimate 
language model probabilities. Rich morphology leads 
to high OOV (Out-Of- Vocabulary) rates and, therefore, 
data sparsity is the main problem. 

This chapter focuses on modelling unit choice for 
inflective languages with the aim of reducing data 
sparsity. Linguistic and data-driven approaches were 
analyzed for this purpose. 



BACKGROUND 

Class-Based Language Models 

Some words are similar in their morphological, syn- 
tactic or semantic functions. In class-based language 
models, similar words are grouped into classes in order 
to improve the robustness of parameter estimation: 

P(w I |w I ._ 1 )=p(w I .|c(wO)p(c(w I .)|c(w I ._ 1 )) 
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C denotes the deterministic mapping of words into 
classes. Non-deterministic mapping can also be de- 
rived at, where one word can belong to many classes. 
A model is also applicable, where the word is directly 
conditioned by the classes of previous words. The idea 
behind class-based models is parameter-set reduction. 
There are far fewer free parameters to estimate in a 
class-based model than in a word-based model. 

Words in the same class are similar in a certain way. 
This similarity can be defined, based on certain exter- 
nal knowledge or statistical criterion. The best known 
example of clustering using linguistic knowledge is 
clustering by POS (Part Of Speech). Eight POSs are 
defined in traditional English grammar: noun, verb, 
adjective, adverb, pronoun, preposition, conjunction, 
and interjection. This set of classes is, however, too 
small for modelling inflective languages. Those classes 
that reflect additional grammatical features (gender, 
case, number, tense, etc.) are more suitable. 

Linguistic classes were examined for several lan- 
guages, which are more or less inflective. A language 
model for French combined POS classes with a com- 
ponent based on lemmas (El-Beze & Derouault, 1 990). 
In the language model for Czech, words were clustered 
into 410 morpho-syntactic classes (Nouza & Nouza, 
2004). 1300 classes were used in another experiment 
for Czech (Kolar, Svec & Psutka, 2004). Class-based 
models with linguistic classes also proved to be success- 
ful for Spanish (Casillas, Varona & Torres, 2004). 

Data driven classes are automatically derived at 
by statistical means. IBM pioneered this approach 
(Brown, de Souza, Mercer, Delia Pietra & Lai, 1992). 
In their approach, words are clustered using a greedy 
algorithm that tries to minimize the loss of mutual in- 
formation between classes incurred during the merge. 
The number of classes must be defined in advance. The 
algorithm continues to merge pairs of classes until the 
desired number of classes has been obtained. Another 
greedy approach uses the exchange algorithm (Martin, 
Liermann & Ney, 1995). Each word is moved from its 
class to another one if it maximizes mutual information 
between classes. 

Data-driven class-based language models have been 
built for many inflective languages. For French they 
show improved performance on small and large corpora 
(Zitouni, 2002). The results have been improved by us- 
ing a hierarchical language model with variable-length 
class sequences, based on 233 grammatical classes. In 
experiments on the Russian language, the best results 



were obtained by using 500 classes (Whittaker & 
Woodland, 2003). The results were further improved 
when a class-based model was combined with a word- 
based model. 

Lots of data must be available to derive at classes 
automatically from the data instead of using external 
knowledge sources. 

Language Models Based on Sub-Word 
Units 

Given the difficulties in language modelling based on 
full word forms it would be desirable to find a method 
of decomposing word forms into their morphological 
components and to build a more robust language model 
based on probabilities involving individual morphologi- 
cal components. 

Lexicons exist for some languages which contain 
information about the morphological components of 
words. In experiments on Czech, words were decom- 
posed into stems and endings using a Czech Morpho- 
logical Analyzer, and were then used as modelling 
units (Byrne, Hajic, Ircing, Krbec & Psutka, 2000). 
Morpheme-based language models were also studied 
for the Korean language, where a word-phrase is an 
agglomerate of morphemes (Kwon & Park, 2003). 
Sub-word units are also used when modelling aggluti- 
native languages where, besides inflections, compound 
words are very common (Szarvas & Furui, 2003). 
Morphological sub-word units have also been proved 
for Turkish (Erdogan, Biiyiik & Oflazer, 2005). The 
language model's constraints were represented by a 
weighted finite state machine. 

Many languages do not have developed morphologi- 
cal analysers. Data-driven discovery of a language's 
morphology is used in such cases. It is common for 
data-driven approaches to outperform linguistic ones. 
Morphemic suffixes were discovered by Minimum 
Description Length (MDL) analysis (Brent, Murthy 
& Lundberg, 1995). MDL analysis has been used for 
morphological segmentation for various European lan- 
guages (Goldsmith, 2001). An algorithm for learning 
morphology using latent semantic analysis was also 
discovered (Schone & Jurafski, 2000). This algorithm 
only extracts affixes when the stem and stem-affix are 
sufficiently similar semantically. The language model 
for Russian also improves when using data-driven 
sub-word units (Whittaker & Woodland, 2000). Lan- 
guage-independent algorithms for discovering word 
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fragments based on MDL have been presented for 
Finnish language (Hirsimaki, Creutz, Siivola, Kurimo, 
Virpioja, & Pylkkonen, 2005). The authors report that 
word fragments obtained using grammatical rules gave 
worse results than fragments obtained by data-driven 
algorithms. They improved speech recognition results 
furthermore by clustering morph n-gram histories (Vir- 
pioja & Kurimo, 2006). The same kinds of comparisons 
with similar conclusions have also been done for the 
Turkish and Estonian languages. 



LANGUAGE MODEL OF INFLECTIVE 
LANGUAGE 

Our work is mainly devoted to the highly inflective 
Slovenian language. It is a South Slavic language. It 
shares its characteristics, in varying degrees, with many 
other inflective languages, especially Slavic. 

As in the case of other inflective languages, we 
concentrate on reducing the perceived data sparsity. The 
techniques we investigate are language-independent 
and, as such, also applicable to other highly inflective 
languages. 

Class-Based Language Models of the 
Slovenian Language 

In our first study, the use of data-driven classes was 
examined. In (Sepesy Maucec, Brest, Kacic & Zumer, 
2000) we described an improved algorithm for word 
clustering. The main idea was to replace the systematic 
replacement of words between classes with a random- 
ized one. Secondly, instead of replacing one word after 
another, a randomly selected group of words was re- 
placed at once. The pseudocode of the algorithm is: 

1. Setup initial mapping 

2. Compute initial train set perplexity PP 

3. while (not stopping criterion is met) do begin 

4. randomly select a set of words 

5. for each selected word randomly select target class 

6. compute the new train set perplexity P P 1 

7. if (PP1 <PP) 

keep words in new classes and P P : =P P 1 
else keep words in old classes 

8. goto step 3 

end 



The main bottleneck for a clustering algorithm is 
time complexity. We developed a parallelized version 
of the algorithm in order to speed it up. Using random 
selection, we achieved a 3.7% improvement in perplex- 
ity when comparing the results with the basic clustering 
algorithm, which replaces words systematically. 

Having V words in the vocabulary and clustered 
into C classes, the space complexity of the class-based 
bigram language model is 0(C 2 + V), in contrast to 
space complexity (^(V 2 ) of the word-based language 
model. Using classes, we can enlarge the vocabulary 
of words by keeping the language model's size small, 
but this does not solve the problem of OO V words. On 
the other hand, most speech recognizers use only word- 
based models. In such cases, class-based models must 
be converted into word-based ones, which considerably 
increases the size. 

Language Models of Slovenian 
Language Based on Data-Driven 
Sub-Word Units 

Slovenian words often have many morphological units 
in common. Two constituent parts can be determined 
when a highly simplified model of a word is examined: 
a stem, which can be thought of as responsible for the 
nuclear meaning of a word, and an ending, which de- 
termines the grammatical features. Not all words can 
be decomposed into stem and ending. In this case an 
empty ending is used. 

In (Sepesy Maucec, Kacic & Horvat, 2004) we 
showed that it makes sense to model the semantic and 
grammatical features of words separately: 




P(w f 



/7 z .) = P(s z .e z |h z .)=p(s z .|h*)p(e z .|hr) 



w. is decomposed into a stem s. and an ending e.. h 
denotes previously observed units in the prediction of 
a stem and an ending. 

The prediction of a stem was exposed to topic 
adaptation. It was presumed that the language in the 
target environment (where final application would be 
used) is topically homogeneous. A general language 
model was tuned to the specific topic by using data 
at three semantic levels. The first level corresponds 
to the general language, characterised by the whole 
corpus. The second level corresponds to the language 
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characterised by a subset of similar documents. The 
third level represents a finer level of language topic 
similarity. 

By considering the lengths of histories in predictions, 
we investigated the following trigram model: 

P(w z - Iw^w^) = 

P(st \si- 2 St-i}$P(et \e i . 2 e i _ i y(l-X)P(e i |s f )) 

Prediction of the stem is based on knowledge of the 
two preceding stems. Prediction of the ending is based 
on the knowledge of the two preceding endings, and 
the current stem. In our experiments, the best results 
were obtained when X = 0. 1 because a relatively small 
set of endings can be appended to a particular stem. 
Some information about word-ending is also contained 
in the endings of neighbouring words. 

The model presupposes a decomposed training cor- 
pus. In (Sepesy Maucec, Rotovnik, & Zemljak Jontes, 
2003) we used a simple decomposition scheme, based 
on a preselected set of endings and the longest-match 
principle. A set of endings was automatically gener- 
ated over three steps. First, a list was created of all 
words written in reversed character order. Words were 
arranged in alphabetical order; thus, words sharing a 
common ending appear together on the list. The initial 
characters of adjacent words in the list are compared 
in order to find a match. Two restrictions were used to 
avoid over-stemming: the remaining stem should be of 
a predefined minimum length and the first character of 
a match must be a vowel. Words should be decomposed 
at consonant- vowel pair because consonants carry more 
information about the meaning of word than vowels. 

We further improved the decomposition of words 
in an iterative manner. We searched for the decomposi- 
tion, which yields the maximized log-likelihood of the 
training corpus, computed based on sub-word trigrams. 
The pseudocode of the algorithm is: 

1. Collect word bigram counts in train set 

2. Set up the initial decomposition 

3. Compute the initial log-likelihood of the train set LL 

4. while ( not stopping criterion is met) do begin 

5. randomly select a set of words 

6. for each selected word randomly setthe new stem-ending 
boundary 

7. compute the new log-likelihood of the train set LL1 

8. if (LL1 >LL) 



accept new decompositions and LL:=LL1 

else keep old decompositions 
9. goto step 4 
end 

The choice of initial decomposition is very impor- 
tant, because final decompositions are only guaranteed 
to be locally optimal. The initial decomposition was 
set at the decomposition proposed in (Sepesy Maucec 
et al., 2003). The stopping criterion was a predefined 
number of iterations. 

Experiments have been performed using a newspa- 
per corpus named 'Vecer'. The size of the corpus was 
85M words (734k distinct words). 14M word bigram 
counts were collected from the corpus. After initializa- 
tion, we had 267k distinct sub-words (264k stems and 
3k endings) and the initial sub-word perplexity was 361. 
After 10,000 iterations the number of distinct sub-units 
increased to 497k (417k stems and 80k endings) but 
sub-word perplexity decreased to 291. Data-driven 
decompositions obtained by this algorithm have already 
been tested in speech recognition experiments (Roto- 
vnik, Sepesy Maucec & Kacic, 2006). The error rate 
decreased by 6.3% when compared with the results of 
speech recognition using word-based models. 



FUTURE TRENDS 

A lot of work has been done on modelling highly inflec- 
tive languages but there still exists a lack of knowledge 
on how to model them 'most effectively'. As an exten- 
sion of the conventional n-gram language model, a 
factored language model has been proposed and tested 
on Arabic (Bilmes & Kirchhoff, 2003). This factored 
form could also be useful for other highly inflective 
languages, because it combines information of different 
types in one general model. To our knowledge, factored 
language models have not been widely studied on other 
highly inflective languages yet, except for Arabic and, 
more recently, Estonian (Alumae, 2006). 



CONCLUSION 

This chapter gives an overview of applied methods 
when modelling highly inflective languages. Consider- 
ing the characteristics of highly inflective languages 
we exposed models of two types: class-based and 
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sub-word based. The motivation behind both of them 
is data-sparsity reduction. 

The main idea of class-based models is to reduce 
the number of free parameters by clustering words 
into classes. It is interesting that data-driven classes 
outperformed linguistic classes in many research 
experiments. 

Sub-word based models reduce the size of the 
vocabulary by splitting words into smaller units and 
storing these sub-word units (instead of words) in the 
vocabulary. Data-driven methods to split words into 
sub-words surpassed grammatical decompositions for 
many languages. 

The reported experiments regarding the use of 
these types of models (especially in combination with 
standard word-based) show an overall reduction of 
errors in the target applications. We draw the same 
conclusions from our experiments on the Slovenian 
language. A promising direction for further work is 
seen in the factored language model. 



REFERENCES 

Alumae, T. (2006). Sentence-Adapted Factored Lan- 
guage Model for Transcribing Estonian Speech, Pro- 
ceedings of the International Conference on Acoustics, 
Speech, and Signal Processing, 1, 429-432. Toulouse, 
France. 

Bilmes, J., & Kirchhoff, K. (2003). Factored Language 
Models and Generalized Parallel Backoff, Proceedings 
of the Human Language Technology Conference, 2, 
4-6. Edmonton, Canada. 

Brent, M., Murthy, S.K., & Lundberg, A. (1995). Dis- 
covering Morphemic Suffixes: a Case Study in MDL 
Induction. Proceedings of the International Workshop 
on Artificial Intelligence and Statistics, 482-490. Fort 
Lauderdale, Florida. 

Brown, RE, de Souza, P.V., Mercer, R.L., Delia Pietra, 
V.J., & Lai, J.C. (1992). Class-Based N-gram Models 
of Natural Language, Computational Linguistics, 
18(4), 467-479. 

Byrne, W., Hajic, J., Ircing, P., Krbec, P., & Psutka, J. 
(2000). Morpheme Based Language Model for Speech 
Recognition of Czech. Lecture Notes in Artificial Intel- 
ligence, 1902,211-216. 



Casillas, A., Varona, A., & Torres I. (2003). Experi- 
ments with Linguistic Categories for Language Model 
Optimization. Lecture Notes in Computer Science, 
2588, 511-515. 

Chen, S.F., & Goodman, J. (1996). An Empirical Study 
of Smoothing Techniques for Language Modelling, 
Proceedings of the 34th Annual Meeting of the Asso- 
ciation for Computational Linguistics, 3 10-3 18. Santa 
Cruz, California. 

El-Beze, M., & Derouault A.M. (1990). A Morpho- 
logical Model for Large Vocabulary Speech Recogni- 
tion, Proceedings of the International Conference on 
Acoustics, Speech, and Signal Processing, 577-580. 
Albuquerque, New Mexico. 

Erdogan, H., Biiyiik, O., & Oflazer, K. (2005). Incorpo- 
rating Language Constraints in Sub-word Based Speech 
Recognition. Proceedings of the IEEE Automatic 
Speech Recognition and Understanding Workshop, 
93-103. San Juan, Puerto Rico. 

Goldsmith, J. (2001). Unsupervised Learning of 
Morphology of Natural Language. Computational 
Linguistics, 27(2), 153-189. 

Hirsimaki, T., Creutz, M., Siivola, V., Kurimo, M., Vir- 
pioja, S., & Pylkkonen, J. (2006). Unlimited Vocabulary 
Speech Recognition with Morph Language Models 
Applied to Finnish. Computer, Speech & Language, 
20(4), 515-541. 

Jelinek, F. (1998). Statistical methods for Speech Rec- 
ognition. Cambridge, Massachusetts: MIT Press. 

Kolar, J., Svec, J., & Psutka, J. (2004). Automatic Punc- 
tuation Annotation in Czech Broadcast News Speech. 
Proceedings of the International Workshop on Speech 
and Computer, 319-325. Patras, Greece. 

Kwon, O.W., & Park, J. (2003). Korean Large Vocabu- 
lary Continuous Speech Recognition with Morpheme- 
based Recognition Units. Speech Communication, 
39(3-4), 287-300. 

Martin, S., Liermann, J., &Ney, H. (1995). Algorithms 
for Bigram and Trigram Clustering. Proceedings of 
the International Conference Eurospeech, 1253-1256. 
Madrid, Spain. 

Nouza, J., & Nouza, T. (2004). A Voice Dictation Sys- 
tem for a Million- Word Czech Vocabulary. Proceed- 




1471 



Statistical Modelling of Highly Inflective Languages 



ings of the International Conference on Computing, 
Communications and Control Technologies, 149-152. 
Austin, USA. 

Rotovnik, T., Sepesy Maucec, M., & Kacic, Z. (2006). 
Large Vocabulary Continuous Speech Recognition of 
Inflectional Language with Stems and Endings, Speech 
Communication, 49(6), 437-452. 

Schone, R, & Jurafsky, D. (2000). Knowledge-Free 
Induction of Morphology Using Latent Semantic Analy- 
sis. Conference on Computational Natural Language 
Learning, 67-72. Lisbon, Portugal. 

Schwenk, H. (2007). Continuous Space Language Mod- 
els. Computer, Speech & Language, 21(3), 492-518. 

Sepesy Maucec, M., Brest, J., Kacic, Z., & Zumer, V. 
(2000). On Solving Statistical Language Modeling for 
Speech Recognition using a Heterogeneous Comput- 
ing system (in Slovene). Electrotechnical Reviews, 
67(1), 55-61. 

Sepesy Maucec, M., Rotovnik, T., & Zemljak Jontes, 
M. (2003). Modelling Highly Inflected Slovenian 
Language. International Journal of Speech Technol- 
ogy, 6(3), 245-257. 

Sepesy Maucec, M., Kacic, Z., & Horvat, B. (2004). 
Modelling Highly Inflected Languages. Information 
Sciences, 166(1-4), 249-269. 

Szarvas, M. & Furui, S. (2003). Evaluation of the 
Stochastic Morphosyntactic Language Model on a 
One Million Word Hungarian Task. Proceedings of 
the International Conference Eurospeech, 2297-2300. 
Geneva, Switzerland. 

Virpioja, S., & Kurimo, M. (2006). Compact N-gram 
Models by Incremental Growing and Clustering of 
Histories. Proceedings of the International Conference 
Interspeech, September 17-21, Pittsburgh, USA. 

Whittaker, E.W.D., & Woodland, P.C. (2000). Particle- 
based Language Modelling. Proceedings of the Inter- 
national Conference on Spoken Language Processing, 
1, 170-173. Beijing, China. 

Whittaker, E.W.D., & Woodland, P.C. (2003). Lan- 
guage Modelling for Russian and English Using Words 
and Classes. Computer, Speech & Language, 17(1), 
87-104. 



Zitouni, I. (2002). A Hierarchical Language Model 
Based on Variable-length Class Sequences: The MC 
[v][n] Approach. IEEE Transactions on Speech and 
Audio Processing, 10(3), 193-198. 



KEY TERMS 

Corpus: A large collection of texts, usually in 
electronic form. The corpus has greater value if it is 
tokenized (segmented into sentences, words etc.) and 
linguistically annotated (for example POS-tagged and 
lemmatized). 

Inflective Language: Alanguage characterized by 
the use of inflections. Inflection is the modification of a 
word in order to reflect grammatical information, such 
as gender, number, person etc. 

Language Model: A description of language. In 
statistical language modelling it is a set of probability 
estimates. 

n-Gram Model: A model, based on the statistical 
properties of n-grams. iV-gram model predicts the i-th 
unit based on the knowledge of n-1 previous units. In 
n-gram modelling the assumption is made, that each 
unit depends only on n-1 previously observed units. This 
is the main deficiency of n-gram modelling, because 
it has been shown that the range of dependencies is 
significantly longer. 

Out-Of- Vocabulary Rate: Number of unknown 
words in a new sample of language (it is called a test 
set), usually expressed in percentage. 

Perplexity: A measure of a language model's qual- 
ity. It can be interpreted as the geometric mean of the 
branch out factor of the language model. A language 
model with perplexity X has the same difficulty as an 
imaginary language in which every word can be fol- 
lowed by X different words with equal probability. 

Sub-Word Unit: Modelling unit smaller than a 
word. Sub-word units are usually morphemes, stems 
and endings, roots, etc. 

Unknown Word: Vocabularies are typically fixed 
to be tens of thousands of words. All words not in the 
vocabulary are mapped to a single distinguished word, 
usually called the unknown word. 
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Vocabulary: A set of words (or other units) being 
modelled. The same vocabulary is used by the language 
model and the target application. 
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INTRODUCTION 



BACKGROUND 



In this article we compare a number of full-adder (1- 
bit addition) cells regarding minimum supply voltage 
and yield, when taking statistical simulations into 
account. According to the ITRS Roadmap two of the 
most important challenges for future nanoelectronics 
design are reducingpower consumption and increasing 
manufacturability (ITRS, 2005). 

We use subthreshold CMOS, which is regarded by 
many as the most promising ultra low power circuit 
technique. It is also shown that a minimum redundancy- 
factor as low as 2 is sufficient to make circuits maintain 
full functionality under the presence of defects. This is, 
to our knowledge, the lowest redundancy reported for 
comparable circuits, and builds on a method suggested 
a few years ago (Aunet & Hartmann, 2003). 

A standard Full- Adder (FA) and an FAbased on per- 
ceptrons exploiting the "mirrored gate", implemented 
in a standard 90 nm CMOS technology, are shown not 
to withstand statistical mismatch and process varia- 
tions for supply voltages below 150 mV. Exploiting 
a redundancy scheme tolerating "open" faults, with 
gate-level redundancy and shorted outputs, shows 
that the same two FAs might produce adequate Sum 
and Carry outputs at the presence of a defect PMOS 
for supply voltages above 150 mV, for a redundancy 
factor of 2 (Aunet & Otnes Berge, 2007). 

Two additional perceptrons do not tolerate the 
process variations, according to simulations. Simula- 
tions suggest that the standard FA has the lowest power 
consumption. Power consumption varies more than an 
order of magnitude for all subthreshold FAs, due to the 
statistical variations. 



The first simple mathematical model of the biological 
neurons, published by McCulloch and Pitts in 1943, 
calculates the sign of the weigthed sum of inputs. 
Sometimes such circuits are called threshold logic 
gates or threshold elements. Perceptrons may be used 
to implement Neural Networks as well as digital signal 
processing. 

Nanoscale CMO S technology is expected to be used 
alongside other technologies in the future. A typical 
chip will fail if even a single transistor or wire on the 
chip is defective. Reducing the power consumption and 
making defect tolerant circuits have been pointed out 
as important issues (Mead, 1990), (ITRS, 2005). 

Reducing the power supply voltage is the most direct 
and dramatic means of reducing the power consumption 
(Liu & Svensson, 1993), and subthreshold circuits op- 
erating with a supply voltage, V dd , less than the absolute 
value of the inherent threshold voltages, Vt, has been 
known for decades (Swensson, Meindl, 1972). 

For older technologies, where manufacturability 
including threshold voltage variability, was not such 
an important issue (ITRS 2005),(Wong, Mittal, Cao 
& Starr, 2004) the minimum supply voltages have 
often been estimated without mismatch and process 
variations being taken into account (Liu & Svensson, 
1993),(Schrom & Selberherr, 1996). To get more re- 
alistic estimates we have simulated and compared 4 
different topologies for 1-bit addition under statistical 
variations in the process and matching properties. 
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MAIN FOCUS OF THE CHAPTER 
MOS Transistors in Subthreshold 

For an NMOS transistor in subthreshold we have 
(Andreou, Boahen, Pouliquen, Pavasovic, Jenkins & 
Strohbehn, 1991): 






I dsn expresses the current from drain to source. 
I is the zero-bias current where the pre-exponential 
constants have been absorbed. This includes the chan- 
nel width ("W") and the length ("L") of the MOSFET 
structure. V s is the gate-to-source potential, V ds the 
drain-to-source potential and Vbs the substrate-to- 
source potential. 

V is the Early voltage, which is proportional to 
the channel length, k gives the effectiveness for which 
the gate potential is controlling the channel current. 
It is often approximately 0.7-0.75 (Andreou, Boahen, 
Pouliquen, Pavasovic, Jenkins & Strohbehn, 1991). 
The thermal voltage is expressed as V=kT/q. V t = 25.8 
mV at room temperature. 

Though equation (1) takes fewer physical effects 
and nonmonotonous behaviour in certain cases into 
account, than for example that reported in (Calhoun, 
Wang & Chandrakasan, 2004), it does provide sufficient 
insight to make a brief analysis of many subthreshold 
circuits. A similar equation apply to PMOS transistors, 
but with opposite polarities. 

Experimental Setup for Statistical 
Simulations of Functionality and Power 
Consumption for 1-Bit Adders 

For statistical (Monte-Carlo) simulations we used a 90 
nm standard CMOS process available through CMP 
(CMP, 2007). Four different Full Adder ("FA") circuits 
having their inputs driven by inverters, and themselves 
driving simple inverters were simulated. This is il- 
lustrated in figure 1. In the case of no redundancy and 
faults the lower FA in figure (1) was not included. 

For each circuit, at 8 different supply voltages, 100 
Monte-Carlo "runs" were done, each having the eight 
possible combinations of the three inputs, for a total 
simulated period (transient simulation) of 400 |is, as 



illustrated in figure 3 for a case after 5 "runs". This 
was far from the maximum operational speed of any 
of the FAs, meaning that the resulting Sum and Carry 
signals had more than enough time to settle. Each of the 
100 runs represented different mismatch and process 
parameters, and for each run we checked if the circuit 
was able to produce correct "0" or "1" outputs for all 
eight input combinations. The yield, shown in figure 
4 represents the percentage of the Full Adders (FAs) 
working for a given supply voltage, out of 100 Monte 
Carlo "runs". 

Redundancy using short circuited driven nodes ( Au- 
net & Hartmann, 2003) was exploited, duplicating each 
gate for the three FAs based on threshold gates (figure 
2). For the other FA only the driven nodes prior to the 
inverters preceeding the S and C nodes were shorted. 
A total of 4 PMOS transistors were removed from the 
4 FAs (one for each FA), so that each FA missed one 
PMOS in one of it's threshold gates. This means that 
each FA in figure 1 had exactly (2N -1) the number of 
transistors, N, when compared to the previous case 
with no redundancy. 

The average power consumption for the eight input 
combinations was also calculated. Each of the four 
circuits perceptrons, with no redundancy, was tested 
for 8 different supply voltages. 

The missing transistor was in the lowermost "min3" 
gate (figure 2). For the mirrored gate the missing PMOS 
was the one having the Z input. For the stacked gate 
as well as the ijcnn gate the missing PMOS was the 
one between the two other PMOS transistors, referred 
to figure 2. 

For the FA in the upper left corner of figure 2 a 
PMOS with it's gate connected to the C. n input was the 
one that was removed. Regarding the rest of the setup 
it was identical to the one in the previous subsection, 
describing the case without redundancy. 

The FAs put to test were a standard CMOS Full 
Adder containing 28 transistors (upper, left, in figure 
2), while the three others were based on the topology 
in the upper, right, corner of figure 2. They were based 
on, from left to right in figure 2, the "mirrored gate" 
(Hampel, Prost & Scheinberg, 1974), the "stacked" 
gate (Aunet, Berg & Beiu, 2005) and the "ijcnn" gate 
(Aunet, Oelmann, Abdalla & Berg, 2004), which are 
all threshold gates. 

Regarding transistor dimensions all gate lengths 
were 100 nm, and all NMOS widths were 220 nm. 
The standard FA and the "stacked" FA had widths of 
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all PMOS equal to 400 nm, while the "ijcnn FA" and 
the "mirrored FA" had PMOS widths of 550 nm and 
650 nm, respectively. Buffers, made from two inverters, 
were inserted on the Sum nodes as well as between the 
two uppermost threshold gates ("min3") in figure 2. 

Results 

The percentage of FA circuits that produced correct 
logic levels for the Sum and Carry signals, under dif- 
ferent conditions, are shown in figure 4. It is clear that 
the standard CMOS FA and the one based on the mir- 
rored gate gives a larger percentage for a given supply 
voltage when compared to the FAs based on the two 
other threshold gates. 



Power consumption as a function of supply voltage 
is shown in figure 5, for the basic circuits without any 
defect transistors or redundancy. 



DISCUSSION 

The standard Full Adder, and the threshold gate based 
topology (upper right corner in figure 2) exploiting 
the mirrored gate, both need supply voltages of at 
least 150 mV to tolerate mismatch and process varia- 
tions, according to our simulations. This may be seen 
to the left in figure 4. The threshold gates "ijcnn" and 
"stacked" does not tolerate statistical variations like 
the two previously mentioned solutions, at least not 
when there are no redundancy and relatively small 



Figure 1. Experimental setup for statistical simulation of 1 -bit adder 
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Figure 2. Schematics for the four 1-bit adders (Full Adders). The standard CMOS version is in the upper left 
corner, while a topology based on perceptrons and inverters is shown in the upper right corner 
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Figure 3. Sum and carry as a function ofX, Y and Z inputs for 5 runs 
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transistors are used. Larger transistors should improve 
the matching properties and make the circuits less 
vulnerable to statistical variations in production, as 
the spread in for example the inherent threshold volt- 
ages is inveresly proportional to the square root of the 
product of the widths and lengths of the MOSFETs 
(Croon, Decoutere, Sansen & Maes, 2004): 6(V T ) = 
A (V T )/Sqrt(WL). 

The mirrored threshold gate was adopted for sub- 
threshold operation and defect-/fault-tolerance using 
shorted outputs (Aunet & Hartmann, 2003) in (Beiu, 
Aunet, Nyathi, Rydberg & Djupdal, 2005) and under- 
went statistical simulations as here, in (Granhaug & 
Aunet, 2006). Then a redundancy factor of 2 combined 
with a supply voltage of minimum 175 mV resulted, if 
a single defect PMOS should be tolerated. In (Granhaug 
& Aunet, 2006) transistor sizing was slightly different, 
and the wells of both the PMOS and NMOS transistors 



were short circuited, as opposed to our case, where 
the wells were connected to the rails. For systems of 
considerable size, implemented in silicon the lowest 
supply voltage might be 175 mV, reported in (Miyazaki, 
Kao & Chandrakasan, 2002). Exploiting redundancy, 
duplicating every gate and tearing one PMOS transistor 
out from each of the four full-adders gave the results 
shown to the right in figure 4. The picture is resembling 
the case to the left, without redundancy, but show some 
differences. The minimum Vdd to make the standard 
FA and the one based on the mirrored gate function for 
all the 100 Monte-Carlo runs was still 150 mV. This 
is a lower supply voltage than the 175 mV found in 
(Granhaug & Aunet, 2006). Transistor sizing as well 
as biasing of wells may have a significant impact on 
the results, especially in subthreshold, with the many 
exponential dependencies as shown in equation 1. 
From figure 4 one can also see that the FAs based on 
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Figure 4. "Yield" from Monte-Carlo simulations of the FAs at different Vdd's (N=100) 
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the "mirrored" and the "ij cnn" gates often have a higher 
yield for a given supply voltage when introducing 
redundancy and a defect transistor, when compared to 
the case without redundancy and any defect. The FA 
based on the "mirrored" gate was the most robust one 
when there was a defect transistor, giving the highest 
yield for low supply voltages, according to simulations 
in figure 4. More simulations, including other defects 
and additional redundancy could be interesting for 
future research. 

Removing single transistors simulates certain 
"open" faults in the redundant units, and the scheme 
using shorted outputs (Aunet & Hartmann, 2003), 
used in for example (Beiu, Aunet, Nyathi, Rydberg & 
Djupdal, 2005), (Granhaug & Aunet, 2006) but may 
not withstand "close" faults like outputs of redundant 
units shorted to one of the supply rails. A method tol- 
erating such defects as well is presented in (Schmid 
& Leblebici, 2003). No single technique is enough for 
tolerating all fault mechanisms in nanoscale circuits 
and systems, it is concluded in (Lehtonen, Plosila & 
Isoaho, 2005), so combinations of several methods 
are needed, depending on the specific design and the 
proneness to different sources of defects (Lehtonen, 
Plosila & Isoaho, 2005). 

The average, maximum and minium power con- 
sumption in the cases where the FAs were able to 
produce correct logic outputs are shown in figure 5. 
The standard CMOS FA shows the lowest average 
power consumption when the supply voltage is above 
150 mV, which is when two of the FAs give a "yield" 
of 100 percent, according to our results. The FA based 
on the "mirrored" gate shows a slightly higher power 
consumption, while the FA based on the "ijcnn" gate 
displays a power consumption up to orders of mag- 
nitude above the others, and increasingly so for the 
relatively higher supply voltages. Even the FAs show- 
ing a relatively high tolerance to mismatch and process 
variations have current levels ranging over more than 
an order of magnitude, or a factor 10 x, for a given 
supply voltage. Power consumption for a given sup- 
ply voltage is expected to increase linearly with the 
redundancy factor. 

The realism in simulations is limited, especially for 
nanoscale CMOS (Nassif, 2006). So, layout techniques 
for high matching, including dummy structures , might 
lead to different results than those presented here. 



FUTURE TRENDS 

The assumption that a system is composed largely of 
correctly functioning units is no longer true in emerg- 
ing nanoelectronics, and reducing the overall power 
consumption is also among the grand challenges for 
future nanoelectronics. The low fan-in perceptrons, also 
called voters, or minority gates, might be very useful 
candidates for future nanoelectronics, which has been 
recently stated ( Beiu & Ibrahim, 2007). Defect tolerant 
subthreshold perceptron circuits exploiting majority 
gates, as presented here, may thus be useful building 
blocks for the future. 



CONCLUSION 

Statistical Monte-Carlo simulations have been per- 
formed on 4 Full Adder circuits. For each FA 100 
Monte-Carlo runs were done at 8 different subthreshold 
supply voltages, and the percentage of the runs provid- 
ing appropriate logic levels for Sum and Carry outputs 
was calculated. A "yield" of 100 percent meant that a 
certain FA would tolerate all simulated combinations 
of statistical variations. The circuits able to reach this 
limit were a standard FA and an FA based on the "mir- 
rored" threshold gate, both needing a supply voltage, 
Vdd, above at least 150 mV to guarantee functionality 
under mismatch and process variations. 

When exploiting redundancy and shorting outputs 
(Aunet & Hartmann, 2003), a supply voltage less than 
150 mV is not enough to tolerate the statistical varia- 
tions when a PMOS is removed from the schematics 
and a redundancy factor of 2 is used. The standard 
and mirrored-based FAs are still working for a sup- 
ply voltage above 150 mV for one defect MOSFET. 
Power consumption varies by approximately 1 order 
of magnitude, for all the 4 simulated FAs in subthresh- 
old, with the standard FA having the lowest power 
consumption at useful supply voltages tolerating large 
statistical variations. 
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KEY TERMS 

Full Adder: Circuit that produces the binary sum 
and carry when adding two binary numbers. 

Minority-3 Gate: A minority 3 gate outputs a logic 
"0" signal if, and only if, 2 or 3 out of it's three binary 
inputs are "1". 
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Mismatch: Ideally identically constructed elements 
on an integrated circuits have a mismatch when they 
differ in their physical properties after production of 
the chip. 

Monte Carlo Simulations: Computer simula- 
tions basing the results on statistical distribution of 
parameters. 

Nanoscale CMOS: CMOS technologies where 
dimensions smaller than 100 nm is critical to the func- 
tioning of the produced chip. 

Neuron: Electrically excitable cells in the nervous 
system that process and transmit information. 

Parameter Variations: Parameters describing 
physical traits of integrated circuits may have variations 
due to mismatch, for example the threshold voltages 
of transistors. 

Perceptron: Type of artificial (feedforward) Neural 
Network. § 

Yield: In this paper the term yield refers to the ratio 
of functional circuits to the total number of simulated 
circuits. Often yield refers to the ratio of functional 
chips to the total number of manufactured chips. 
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INTRODUCTION 

Over the past several decades, multilayer perceptrons 
(MLPs) have achieved increased popularity among 
scientists, engineers, and other professionals as tools for 
knowledge representation. Unfortunately, there is no a 
universal architecture which is suitable for all problems. 
Even with the correct architecture, frustrating problems 
of connection weights training still remain due to the 
rugged nature of the energy landscape of MLPs. The 
energy function often refers to the sum-of-square error 
function for conventional MLPs and the negative log- 
posterior density function for Bayesian MLPs. 

This article presents a Monte Carlo method that 
can be used for MLP learning. The main focus is on 
how to apply the method to train connection weights 
for MLPs. How to apply the method to choose the 
optimal architecture and to make predictions for future 
values will also be discussed, but within the Bayesian 
framework. 



BACKGROUND 

As known by many researchers, the energy landscape 
of an MLP is often rugged. The gradient-based training 
algorithms, such as back-propagation (Rumelhart et 
al., 1986), conjugate gradient, Newton's method, and 
the BFGS algorithm (Broyden, 1970, Fletcher, 1970, 
Goldfarb, 1970, Shanno, 1970), tend to converge to 
a local minimum near the starting point, rendering 
the training data learned insufficiently. To reduce the 
chance of converging to local minima, a number of 
variants of these algorithms have been proposed based 
on the idea of perturbation (von Lehmen et al., 1988, 
Tang et al., 2003 and references therein). In practice, 
the effects of these perturbations are usually limited, 
which only delay the learning process converging to 
local minima a reasonable number of iterations (Ing- 
man & Merlis, 1991). 



To avoid the local-trap problem, simulated anneal- 
ing (SA) (Kirkpatrick et al., 1983) has been employed 
by some authors to train neural networks. Amato et 
al. (1991) and Owen & Abunawass (1993) show that 
for complex learning tasks, SA has a better chance to 
converge to a global minimum than have the gradient- 
based algorithms. Geman & Geman (1984) show thatthe 
global minimum can be reached by S A with probability 
1 if the temperature decreases at a logarithmic rate of 
0(l/log t), where t denotes the number of iterations. 
In practice, however, no one can afford to have such 
a slow cooling schedule. Most frequently, people use 
a linearly or geometrically decreasing cooling sched- 
ule, which can no longer guarantee the global energy 
minimum to be reached (Holley, et al., 1989). 

Other stochastic algorithms that have been used 
in MLP training include the genetic algorithm (Gold- 
berg, 1989) and Markov chain Monte Carlo (MCMC). 
Although the genetic algorithm works well for some 
problems, see, e.g., van Rooij et al. (1996), there is no 
theory to support its convergence to global minima. 
MCMC algorithms are mainly used for Bayesian MLPs 
(MacKay, 1992a, Neal, 1996, Muller & Insua, 1998, de 
Freitas et al., 2000, Liang, 2003, 2005a,2005b), which 
will be discussed later. 



MAIN FOCUS OF THE CHAPTER 

This article presents how the stochastic approximation 
Monte Carlo (SAMC) (Liang et al., 2007) algorithm 
can be used for MLP learning, including training, 
prediction and architecture selection. 

A Brief Review for the SAMC Algorithm 

Suppose that we are working with the Boltzmann 
distribution, 



1 



z 



p(x) = — 



a 



(i) 
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where Z is the normalizing constant, U(x) is the energy 
function, x is the temperature, and Q is the sample 
space. Without loss of generality, we assume that Q is 
compact. For MLPs, x denotes the vector of connection 
weights, and Q can be restricted to a hyper-rectangle 
[-B n , B n ] dim(fi) , where B fi is a large number such that 
CI includes at least a global minimum of U(x). Fur- 
thermore, we assume that the sample space can be 
partitioned according to the energy function into m 
disjoint subregions: E 1 = {x:U(x) < u^}, E 2 = {x:u 1 < 
LT(x)<ii },...,£ = {x:u <U(x)<u J, andE = 

v / — 2 J ' ' m-1 L m-2 v / — m _i J ' m 

{x:U(x) > u m ^), where i/ 1 ,...,u m _ 1 are pre-specified real 
numbers. SAMC seeks to draw samples from each 
subregion with a pre-specified frequency. If this goal 
can be achieved, then the local-trap problem can be 
avoided successfully. Letx +1 denote a sample simulated 
from the distribution 



ft.w-Z^c 



e ti 



(2) 



using the Metropolis-Hastings (MH) algorithm (Me- 
tropolis et al., 1953, Hastings, 1970), where ^(x) = 
e -u(x)/x anc j q^ _ (9 ,...,8 ) is an m-vector in a space 0. 
For simplicity, we assume that is compact, e.g., @ 
= [- B , B ] dim(o) with B being a large number. Since 
adding to or subtracting from t a constant will not 
change p (x), f can be kept in the compact set in simu- 
lations by adjusting with an additive constant. Let the 
proposal distribution, q(x,y), of the MH moves satisfy 
the minorisation condition (Mengersen & Tweedie, 
1996), i.e., 



sup 0en sup xygQ 



PeOO 



< 00 



(3) 



Since Q is compact, a sufficient design for the 
minorisation condition is to choose q(x, y) as a global 
proposal distribution. A proposal distribution is said 
global if q(x, y) > for all x, y e £1 For MLPs, q(x, y) 
can be chosen as a random walk Gaussian proposal,^ ~ 
N(x, o 2 i), where lis an identity matrix and o 2 is calibrated 
such that the MH moves have a desired acceptance rate. 
As discussed later, restricting the proposal distribution 
to be global ensures the convergence of the annealing 
SAMC algorithm to the global energy minima. 



Let {y t } be a positive non-decreasing sequence 
satisfying the conditions: 




i * t=0 

00 



< 00 



for some 8 e (1, 2). For example, one can set 



A 11 



max(t ,t) 



(4) 



for some values of L > 1 and 



T1G(-,1) 



A large value of t will allow the sampler to reach all 
subregions very quickly, even in the presence of mul- 
tiple local minima. Let n = (n v ... 9 nj be an m-vector 
with < 7i. < 1 and 



I*. =1 



which defines a desired sampling frequency distribu- 
tion on the subregions. With the above notations, an 
iteration of SAMC can be described as follows. 

SAMC Algorithm 

a. Generate x ~ K Q (x t ,.) with a single MH step: 

1. Generate y according to the proposal dis- 
tribution q(x t ,y). 

2. Calculate the ratio 

o^-e^ vCy) q(y,Xt) 



_ °tJ(x t )-"t7(y) 



r = e 



vOt)q(w) 



3. 



where J(x) denote the index of the subregion 
that the sample x belongs to. 
Accept the proposal with probabilitymin( 1, 
r). If it is accepted, set x = y; otherwise, 



setx t+1 = x. 
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7i), where y is called the 
= 1 if 



e_ ), and e w 

1+1,171'' tH 



b. Set 0* = 9, + y t (e t+1 
gain factor, e f+1 = (e t+11 ,... 
x g E and otherwise. 

c. If 9* e 0, set 9 t+1 = 0*; otherwise, set 9 f+1 = 0* + c*, 
where c* is a constant vector and is chosen such 
that 0* + c* e &. The existence of c* is obvious, 
since B & has been set to a large number and it is rea- 
sonable to assume that max^G,* -minj^G* <^c B Q 
holds at each iteration. 

A remarkable feature of SAMC is its self-adjusting 
mechanism. If a proposal is rejected, the weight of 
the subregion that the current sample belongs to will 
be adjusted to a larger value, and thus a proposal of 
jumping out from the current subregion will be less 
likely rejected in the next iteration. This mechanism 
effectively prevents the system from getting trapped in 
local minima. This is very important for MLP training 
as its energy landscape is often rugged. 

SAMC falls into the category of stochastic approxi- 
mation algorithms (Robbins & Monro, 1951, Andrieu 
et al., 2005 and references therein). The convergence 
of SAMC can be extended from a theorem presented 
in Liang et al. (2007). Under mild conditions and as 



e„-> 



where 



C + log^\|f(x)dx)-log(7E f +0, E t *0, 

-™, E z =0, (5) 



G=I 



• r -. n,^i ( m - m ()) 

jz{i:E i =0} ] J v 0/ 



and m Q = #{z : E. = 0} is the number of empty subre- 
gions, and C is an arbitrary constant. A subregion E. 
is said to be empty if 



is equal to a known number. In addition, Liang (2007) 
shows that can converge in the form L 2 at a rate of 
0(l/t). Let n ti = P(x t g E z ) be the probability of sam- 
pling from the subregion E. at iteration t. Equation 
implies that as t —> oo, n ti will converge to n. + £ if E. 
^ and otherwise. This further implies that as the 
number of iterations goes to infinity, SAMC can ap- 
proximately draw samples from each of the subregions 
with a pre-specified probability. With an appropriate 
specification of n, sampling can be biased to the low 
energy regions to increase the chance of finding the 
global minimum. 

Annealing SAMC for MLP Learning 

In theory, SAMC is able to find the global energy 
minima if the run is long enough. However, due to 
the broadness of the sample space, the process may 
be slow even when sampling is biased to low energy 
subregions. To accelerate the search process, one can 
iteratively shrink the sample space in simulations. As 
argued below, this modification preserves the theoretical 
property of SAMC when a global proposal distribu- 
tion is used. 

Suppose that the subregions E 1 ,...,E m have been 
arranged in ascending order by energy; that is, if i < 
j then U(x) < U(y) for any x e E. andy e E .. Let k(u) 
denote the index of the subregion that a sample x with 
energy u belongs to. Let Q f denote the sample space at 
iteration t. Annealing SAMC, which will be abbreviated 
as ASAMC hereafter, starts with 

and then iteratively sets 



K(U l min +A) 



j\|/(x)dx = 



a 



U *, 



(6) 



In SAMC, the sample space partition can be made 
blindly by simply specifying some values u lV .., u mV 
This may result in some empty subregions. The con- 
stant C can be determined by imposing a constraint 
on t , say, 



where LT^ in is the minimum energy value obtained by 
iteration t, A>0 is a user specified parameter. The sample 
space Q t shrinks iteration by iteration. In this sense, the 
modified algorithm is called ASAMC. 

Since the proposal distribution is global, the con- 
vergence property of SAMC still holds for ASAMC 
on the limiting space Q^ = lim^^ Q f , although Q^may 
contain some separated regions. The existence of Q^ 
is true due to the monotonicity of the sequence Q : 3 
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Q 2 3 -. If follows from Scheffe's theorem (Scheffe, 
1947) that as t — > oo, x t will converge in distribution to 
a random variable with density 



^"'•f'fi^iM^, 



J £ X|/(x)dx 



£,), 



(7) 



where l/ . denotes the global minimum of the energy 

mm ° °^ 

function [/(x). Again, as in SAMC, the convergence can 
be attained in the L 2 form at a rate of 0(l/t). If we let 
A go to zero, then the ASAMC samples will converge 
in distribution to the global minima of U(x). 

For an effective implementation of AS AMC, several 
issues need to be considered. 

Sample space partitioning. Since within the same 
subregion, ASAMC is reduced to sampling from the 
unnormalized density ^(x), we suggest that the maxi- 
mum energy difference in each subregion should be 
bounded by a reasonable number, say, 2x, to ensure that 
the local Metropolis-Hastings moves within the same 
subregion have a reasonable acceptance rate. 

Choice of A. The performance of ASAMC depends 
on the value of A to some extent. If A is too large, 
ASAMC may take a long time to locate the global 
minimum due to the broadness of the sample space. 
If A is too small, ASAMC may also take a long time 
to locate the global minimum. In this case, the sample 
space may contain only a few separated regions, and 
the most proposed transitions will be rejected. In our 
experience, a value of A between 5 and 10 works well 
for most MLP problems. 



Desired sampling distribution. The choice of n is 
not critical to the efficiency of ASAMC, as in which 
the sample space has been shrinked with iterations. On 
the contrary, in SAMC, n should be chosen carefully 
to bias sampling to low energy regions to improve 
ergodicity of the simulation. 

Gain factor. To estimate the integrals 




j\|/(x)dx,..., jV(x)dx 



accurately, y t should be very close to at the end of 
simulations. Otherwise, the resulting estimates may 
have a large variation. The decreasing speed of y t can 
be controlled by t Q and r|. In practice, we often fix r| = 
1 and vary the value of t Q according to the complexity 
of the problem. The more complex the problem is, the 
larger value of t Q one should choose. 

Convergence diagnostic. A formal diagnostic for 
the convergence of ASAMC should base on multiple 
runs. A rough diagnostic for a single run can be done 
by comparing the observed sampling frequencies and 
the desired sampling frequencies of different subre- 
gions. If they match with each other very well, we may 
regard the run converged. Otherwise, one may re-run 
the algorithm with a larger number of iterations or a 
larger value of t Q . 

ASAMC has been compared in Liang (2007) with 
simulated annealing, SAMC, and the BFGS algorithm 
on a number of examples, including the famous N-par- 
ity and two-spiral problems. The numerical results for 
the two-spiral problem are re-presented in Table 1 and 



Table 1. Comparison of ASAMC, SAMC, SA and BFGS for the two-spiral problem. Notations: let z. denote the 
minimum energy value obtained in the ith run. "Mean"=f t z /20, "SD" is the standard deviation of "mean", 
"Minimum "= min^ z { , "Maximum "= max z . =1 z h "Proportion "=#{i : z < 0.2}, "Iteration " is the average number 
of iterations performed in each run, and "Time" is the average CPU time cost by each run. 



Algorithm 


Mean 


SD 


Minimum 


Maximum 


Proportion 


Iteration(10 6 ) 


Time 


ASAMC 


0.620 


0.191 


0.187 


3.23 


15 


7.1 


94m 


SAMC 


2.727 


0.208 


1.092 


4.09 





10.0 


132m 


SA-1 


17.845 


0.706 


9.020 


22.06 





10.0 


123m 


SA-2 


6.433 


0.450 


3.030 


11.02 





10.0 


123m 


BFGS 


15.500 


0.899 


10.00 


24.00 





— 


3s 



1485 



Stochastic Approximation Monte Carlo for MLP Learning 



Figure 1. Classification maps learned for the two-spiral problem by ASAMC with a MLP of 30 hidden units. The 
black and white points show the training data for the two different spirals, respectively, (a) Classification map 
learned in a run. (b) Classification map averaged over 20 run. 
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Figure 1. Refer to Liang (2007) for the settings of the 
respective algorithms. The results for the other examples 
are similar. In summary, ASAMC outperforms the 
other algorithms in both training and test errors. Like 
other stochastic algorithms, ASAMC requires longer 
training time than do the gradient-based algorithms. It 
provides, however, an efficient approach to train MLPs 
for which the energy landscape is rugged. 

Bayesian MLP Learning 

SAMC can also be used for training Bayesian MLPs. 
Let ^(x) denote the posterior density of a MLP (up to 
a normalizing constant), and g t = lim^^ e 9ti . Thus, the 
following density 



m 9, 



(8) 



can work as a trial density for sampling from ^(x). As 
a trial density, it possesses two nice properties. First, 
the importance weight is bounded above by max,. g i , 
assuming that g z has been normalized by an additional 
constraint, e.g., 

is a known constant. Second, sampling from p(x) will 
lead to a random walk in the space of nonempty sub- 



regions if we regard each subregion as a point. Hence, 
the whole sample space can be well explored. 

Suppose that important samples (x 1? w^),...,(x n , w n ) 
have been drawn from using a MCMC sampler, where 
w. denotes the importance weight of x.. Let f(z |x) denote 
the output of the MLP with input z. For a new input z , 
the Bayesian point prediction is then 



f(h) 



Z"=i W 'f( Z ol x i) 



z;., 



W; 



(9) 



Evidence Evaluation for Bayesian MLPs 

In addition to MLP learning, SAMC also provides a 
convenient way for evaluating evidence of Bayesian 
MLPs. As pointed out by MacKay (1992b), the Bayes- 
ian evidence can be used as a guideline of architecture 
selection for Bayesian MLPs. Let f(D\x) denote the 
likelihood function of a given MLP model, and let /(x) 
denote the prior density imposed on x. As before, we 
suppose that Q has been restricted to a compact set. 
Define the function 



\|/(x,/c): 



f(D|x)/(x), fc=l 
VI O |, k = 



(10) 



on the product space Qx{0,l}, where |Q| denotes the 
hypervolume of the space Q. Partition the product space 
as follows: E Q = {(x, k) : k = 0, xeQ}, £ 1 = {(x, k) : k = 
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1, U(x) < uj, • • •, E m = {(x, k) : k = 1, U(x) > i/^}. If 
SAMC is run with this partition, the evidence of the 
MLP can then be estimated by 

(*o+G)ffo (H) 

where 

g = L¥(^0)dx, 

and < 7i < 1 . We note that ^(x,0) can be any non-nega- 
tive function with g Q being analytically available. 

FUTURE TRENDS 

In the future, we need to carry out a series of comparisons 
to assess the ability of SAMC in different aspects. For 
example, we need to compare SAMC with advanced 
MCMC samplers, such as parallel tempering (Geyer, 
1991) and evolutionary Monte Carlo (Liang & Wong, 
2001), to assess its ability in Bayesian prediction; and 
to compare SAMC with the Gaussian approximation 
method (MacKay, 1992b) to assess its ability in evi- 
dence evaluation. 



CONCLUSION 

This article proposes an innovative method for MLP 
training, prediction, and architecture selection. The 
strength of SAMC comes from its self-adjusting 
mechanism, which enables it to overcome the local- 
trap problems. Like simulated annealing and genetic 
algorithms, SAMC avoids the requirement for the 
gradient information of the objective function. Hence, 
it can be used as a general optimization, simulation, 
and integration tool in many other problems, such 
as combinational optimization, model selection, and 
statistical simulations. 
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KEY TERMS 

Genetic Algorithm: A search heuristic used in 
computing to find true or approximate solutions to 
global optimization problems. 

Markov Chain Monte Carlo (MCMC): A class 
of algorithms for sampling from probability distribu- 
tions by simulating a Markov chain that has the desired 
distribution as its stationary distribution. The state of 
the Markov chain after a large number of steps is then 
used as a sample from the desired distribution. 
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Metropolis-Hastings Algorithm: A popular Simulated Annealing: Ageneric probabilistic meta- 

MCMC algorithm with the acceptance probability algorithm used to find true or approximate solutions to 

{l,[f(y)q(y,x)]/[f(x)q(x,y)]} for a new state y given the global optimization problems. 

current state x, where /(•) is the target distribution and „ , . * . .*, ., * i i . 

, v . ri i j. , .1 ,. StochasticApproximationAlgorithm:Aprobabi- 

q(y) is the proposal distribution. _. . _ ff __ °^ __. r liv>r 

listic meta-algonthm suggested by Robbins and Monro 

Model Evidence: The log-marginal likelihood of (1951) for solutions of regression equations. 

the data obtained by integrating out the parameters over 

the space of models. Its value expresses the preference 

shown by the data for different models. 

Multiple Layer Perceptron (MLP): An impor- 
tant class of neural networks, which consists of a set 
of source nodes that constitute the input layer, one or 
more layers of computational nodes, and an output 
layer of computational nodes. The input signal propa- 
gates through the network in a forward direction, on a 
layer-by-layer basis. 
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INTRODUCTION 

An Artificial Neural Network (ANN) is a computa- 
tional structure inspired by the study of biological 
neural processing. Although neurons are considered 
as very simple computation units, inside the nervous 
system, an incredible amount of widely inter-connected 
neurons can process huge amounts of data working in 
a parallel fashion. There are many different types of 
ANNs, from relatively simple to very complex, just 
as there are many theories on how biological neural 
processing works. However, execution of ANNs is 
always a heavy computational task. Important kinds 
of ANNs are those devoted to pattern recognition such 
as Multi-Layer Perceptron (MLP), Self-Organizing 
Maps (SOM) or Adaptive Resonance Theory (ART) 
classifiers (Haykin, 2007). 

Traditional implementations of ANNs used by 
most of scientists have been developed in high level 
programming languages, so that they could be executed 
on common Personal Computers (PCs). The main 
drawback of these implementations is that though 
neural networks are intrinsically parallel systems, 
simulations are executed on a Central Processing 
Unit (CPU), a processor designed for the execution 
of sequential programs on a Single Instruction Single 
Data (SISD) basis. As a result, these heavy programs 
can take hours or even days to process large input data. 
For applications that require real-time processing, it 



is possible to develop small ad-hoc neural networks 
on specific hardware like Field Programmable Gate 
Arrays (FPGAs). However, FPGA-based realization 
of ANNs is somewhat expensive and involves extra 
design overheads (Zhu & Sutton, 2003). 

Using dedicated hardware to do machine learning 
was typically expensive; results could not be shared 
with other researchers and hardware became obsolete 
within a few years. This situation has changed recently 
with the popularization of Graphics Processing Units 
(GPUs) as low-cost and high-level programmable 
hardware platforms. GPUs are being increasingly 
used for speeding up computations in many research 
fields following a Stream Processing Model (Owens, 
Luebke, Govindaraju, Harris, Kriiger, Lefohn & Pur- 
cell, 2007). 

This article presents a GPU-based parallel imple- 
mentation of a Fuzzy ART ANN, which can be used 
both for training and testing processes. Fuzzy ART is 
an unsupervised neural classifier capable of incremental 
learning, widely used in a universe of applications as 
medical sciences, economics and finance, engineering 
and computer science. CPU-based implementations 
of Fuzzy ART lack efficiency and cannot be used for 
testing purposes in real-time applications. The GPU 
implementation of Fuzzy ART presented in this article 
speeds up computations more than 30 times with respect 
to a CPU-based C/C++ development when executed 
on an NVIDIA 7800 GT GPU. 
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BACKGROUND 

Biological neural networks are able to learn and adapt 
its structure based on the external or internal information 
that flows through the network. Most types of ANNs 
present the problem of catastrophic forgetting. Once 
the network has been trained, if we want it to learn 
from new inputs, it is necessary to repeat the whole 
training process from the beginning. Otherwise, the 
ANN would forget previously acquired knowledge. S. 
Grossberg developed the Adaptive Resonance Theory 
(ART) to address this problem (Grossberg, 1987). Fuzzy 
ART is an extension of the original ART 1 system that 
incorporates computations from fuzzy set theory into 
the ART network, and thus making it possible to learn 
and recognize both analog and binary input patterns 
(Carpenter, Grossberg & Rosen, 1991). 

GPUs are being considered in many fields of 
computation and some researchers have made efforts 
for integrating different kinds of ANNs on the GPU. 
Most research has been done for implementing Multi- 
Layer Perceptron (MLP) taking advantage of the GPU 
performance in matrix-matrix products (Rolfes, 2004) 
(Oh & Jung 2004) (Steinkraus, Simard & Buck 2005). 
Other researchers have used the GPU for Self Organiz- 
ing Maps (SOM) with great results (Luo, Liu & Wu, 
2005) (Campbell, Berglund & Streit, 2005). Bernhard 
et al. achieved a speed increase of between 5 and 20 
times simulating large networks of Spiking Neurons 
on the GPU (Bernhard & Keriven, 2006). Finally, 
Martinez-Zarzuela et al. developed a generic Fuzzy 
ART ANN on the GPU achieving a speed up higher 
than 30 over a CPU (Martinez-Zarzuela, Diaz, Diez 
& Anton, 2007). 

Commodity graphics cards provide a tremendous 
computational horsepower. NVIDIAs GeForce 7800 
GTX GPU is able to sustain 165 GFLOPS against the 
25.6 GFLOPS theoretical peak for the SSE units of 
a dual-core 3.7 GHz Intel Pentium Extreme (Owens, 
Luebke, Govindaraju, Harris, Kriiger, Lefohn & Pur- 
cell, 2007). Newest generation of graphics cards, like 
NVIDIA Geforce 8800 Ultra, or AMD (ATI) Radeon 
HD 2900 XT, can give a peak performance higher than 
500 Gflops and 100 GB/s peak memory bandwidth. 
Graphics cards manufacturers have recently discovered 
the field of high performance computing as to be a target 
market for their products and are providing specific 
hardware and software to couple with enterprises and 
researchers heavy computational requirements. 



FUZZY ART NEURAL NETWORK 
STREAM PROCESSING 

This article describes a parallel implementation of a 
Fuzzy ART ANN using a stream processing model. In 
this uniform parallel processing paradigm a series of 
computations, defined by one function or kernel, are 
made over an ordered set of data or stream on a Single 
Instruction Multiple Data (SIMD) basis. The main re- 
striction of the model is also one of the reasons it can 
provide large increases in performance and a simplified 
programming model: operations on each stream element 
are independent, allowing the execution of the kernel 
on different hardware processing units simultaneously, 
and avoiding stalls that could occur because of inter- 
units data sharing. 

GPUs used to have two types of programmable 
processors, namely vertex and fragment processors. 
Both kinds of processors were devised to operate on 
four component vectors, as the basic primitives of 3D 
computer graphics are 3D vertices in projected space 
(x, y, z, w) and four component colors (red, green, 
blue, alpha). Both vertex and fragment units could be 
used to execute a kernel over a stream of data (Stream 
Processing) and are programmed using shaders that can 
be written using high level languages as Cg (Randima 
& Kilgard, 2003), GLSL or HLSL. Latest generation 
of GPUs, like nVIDIA GeForce 8800 GTX, do not 
include fragment of vertex processors, but unified 
Stream Processors (SPs): generalized floating-point 
scalar processors capable of operating on vertices, 
pixels, or any manner of data. These new GPUs can 
be programmed using CUDA (Compute Unified De- 
vice Architecture) Toolkit from nVIDIA. CUDA is 
a promising new software development solution for 
programming GPUs, simplifying software development 
by using the standard C language. Before CUDA was 
launched programming GPUs for General Purpose 
computation (GPGPU) involved translating algorithms 
into graphics terms (Harris, 2005). Other companies like 
Rapidmind are developing easy-to-program APIs that 
use just-in-time (JIT) compilers for translating source 
code into a format that will work on several system's 
hardware (GPU, Cell or an x86 CPU). Arrays of data 
can be uploaded from the CPU to the GPU memory 
and stored in textures. RGBA textures can be used 
to store 4 floating point data per texture unit (texel). 
Data is modified along the graphics pipeline and then 
written to the frame-buffer memory or rendered to a 
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new texture, allowing a direct feedback of the output 
to the pipeline's entry. 



j is found to meet the vigilance criterion, a new output 
neuron is committed. 



Fuzzy ART Equations 

Fuzzy ART systems are comprised of three layers or 
fields of nodes. First layer receives the input vector 
denoted by I = ( J 1? • • • ? J M ) . Nodes in the output layer 
represent the active code or category of the input pat- 
tern being selected. For each output neuron, a choice 
function T.(j : 1..JV) is defined by: 



w; ew = P(/Aw J 0/d ) + (l-P)w; 



old 

J 



I Jaw. 
7.(7)-' 



a+\w 



(1) 



where w. = (w. 1 ,...,w. M ) denotes associated Long-term 
Memory (LTM) trace, fuzzy MIN operator a is defined 
by (p. a q.) = min(p., q.) and the norm |-| is defined by 

IpNEmIaI. 

Category choice is indexed by J, where T = max(T. : 
j = 1...JV) and system enters in resonance if the match 
function meets the vigilance criterion: 



I A Wj 



P- 



(2) 



When this occurs, vector w. is updated using (3). 
Otherwise, node J is inhibited making T = 0. If no node 



(3) 



Fuzzy ART Training Process on the GPU 

Learning is not a parallel but a sequential process. Dif- 
ferent input patterns cannot be learned at the same time, 
because they would all generate different categories. 
Optimizing the training process for parallel execution 
must be done when searching for the category that most 
resembles the input pattern. Fuzzy ART implementa- 
tions on the CPU sequentially compute the activity 
for every output node. Then, a sort operation is made 
in order to know which neuron is most fired by the 
input pattern (1). If the category stored in this neuron 
resembles input pattern (2), its associated weights are 
updated with the new information (3); otherwise, the 
next most fired neuron must be analyzed. In a parallel 
stream processing implementation, we can compute 
the choice function for every output neuron (1) in a 
parallel fashion. Moreover, we can obtain the match 
function (2) for every node simultaneously. 

In a GPU implementation, weights of every com- 
mitted neuron w. are stored as rows in a texture \\F. 
Input pattern j is rendered to every row of a texture 
F with same dimensions as W, so that during category 
choice, it can be compared to every LTM traces at once, 
as it is shown in Fig. la). Global operations over the 
elements of a stream of data, such as calculating its 
maximum or the sum are tricky to perform in a GPU 



Figure 1. Training process of a Fuzzy ART ANN on a GPU 
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and must be accomplished by doing several render 
passes. A ping-pong technique consists in using the 
output of a rendering pass as input in the next one. In 
each pass a local operation is made between neighbor- 
hood elements in a texture and the results are written 
to a smaller texture. After a series of reductions, the 
final result is obtained (Horn, 2005). Calculating the 
norms | J a w. | and |w.| is made using a column reduc- 
tion operation along textures W^nd Fa W, as it is 
shown in Fig lb). 

The use of RGBA textures allows running MIN 
and SUM operations on 4-component vectors in one 
clock cycle on every fragment shader unit, making the 
process faster. If dimensions of input patterns are not 
multiple of 4, unused channels of the RGBA textures 
must be padded with zeros. Reduced textures are then 
used to store the activity of each neuron, satisfying 
the match criteria, on the R channel of a texture T; the 
G channel is used to store the category index; the A 
channel takes the value of 1 in case the match criteria 
is satisfied and otherwise; finally, channel B can be 
used for printing the matching rate, which can be very 
useful for debugging purposes. 

The J th neuron is found using a row reduction 
operation over texture T, in which those fragments 
not satisfying the match criteria are discarded. If the 
system enters in resonance, the weights of the selected 
category are updated by rendering into the correspond- 
ing sub-region of texture W. If not, the new pattern is 



learned by rendering to an unused row of weights in 
^according to equation (3). 

Fuzzy ART Testing Process on the GPU 

The Fuzzy ART testing algorithm is easier and much 
more profitable to implement on the GPU. In this 
process, several input patterns can be categorized in a 
parallel fashion when learning mode is switched off. 
The best data configuration takes advantage of every 
stream processor available on the GPU for categorizing 
each pattern in several shader passes. Fig. 2 shows the 
organization of the data on the GPU. In the proposed 
system, for every (x,y) coordinate pair on the input 
data, a pattern is stored along the z direction. A single 
RGBA texture can store 4 component input vectors, 
and several RGBA textures can be used to store greater 
patterns. After N shader passes, being N the number of 
committed categories by the network, an output RGBA 
texture, containing classification information for every 
pattern, can be obtained. In Fig. 2.b) it is shown the 
output for shader pass 30. 

A texture W is used for storing F 2 field neuron 
weights on the GPU. Each row stores a LTM trace w., 
just as in the training implementation. Input vector com- 
ponents stored in RGBA input textures are compared 
with corresponding column of weights on W. In each 
shader pass, the activation of the kth output neuron 
and the match function are computed for every input 




Figure 2. Testing process of a fuzzy ART ANN on a GPU 





a) Classification between N categories through N shader passes b) Output classification results after 

shader pass 30 
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pattern. These values are rendered into an RGBAoutput 
texture, which is used as input for the next iteration, 
again using the ping-pong technique. If the activation 
in pass k is bigger than the computed activation in pass 
k - 1 and the match criteria is satisfied, then the index 
category is updated on the output texture. Rendering 
both the index of the selected category and the match 
function to the output texture allows the expert to vi- 
sually analyze the result: different levels on channel 
R represent different categories and alpha channel 
shows the level of resemblance of the input pattern to 
the selected category. 



EXPERIMENTAL RESULTS 

In order to measure the performance of the implemen- 
tation, several tests were done on a CPU with a Fuzzy 
ART C++ self-written implementation and on a GPU 
using the previously described C++/OpenGL/Cg imple- 
mentation. Timings were taken on a 3.2 GHz Pentium 
4 with 1 GB RAM and a GeForce 7800GT 256 MB. 
Performance of Fuzzy ART relays on several factors: 
length of the input pattern I , number of input patterns 
P presented to the network and number of committed 
categories N. During the learning process, N varies 
depending both on the grade of similarity between 
patterns and the vigilance parameter p (2). For the 
training tests, a synthetic benchmark, comprised of 



several sets of patterns, was generated. In each set, the 
length of input vectors M and the number of expected 
categories N vary (see Table 1). In order to guarantee 
N was not too influenced by M and P, a Multivariate 
Normal Distribution was used for pattern generation. 
Being N the number of categories in a set of P pat- 
terns f p = (a,<J c ),p = 1---P, the k patterns belonging 
to category N. within the set, were generated using a 
normal distribution for each vector a ~ N N (|i, ZJ) , and 
then obtaining its complement coding a ° . In vector \i, 
the mean for every component is selected to be in the 
(0,1) range and covariances were set to null in covari- 
ance matrix X. Finally, parameters in the network were 
chosen to be p = 0.9, a = 0.05 and p = 1. 

Table 1 reveals that the training process takes 
more time to execute on the GPU than on the CPU. 
As stated before, learning is a sequential process, thus 
we cannot re-write Fuzzy ART learning algorithm for 
an optimal parallel execution. However, the proposed 
design demonstrated to be faster than a Matlab imple- 
mentation of Fuzzy ART, where even a collection of 
50xl0 3 patterns with dimension 4 takes 380 s to train. 
Performance of training is expected to grow in appli- 
cations where the number of committed nodes is very 
large, so that fragment processors are in use for longer 
periods of time. 

For measuring the time taken by the testing process, 
a different collection of benchmarks was generated and 
the ANN was tested using previously stored LTM traces. 



Table 1. Times for training and testing on a CPU and on a GPU 









TRAIN 




TEST 




M 


P(xl0 3 ) 


N 


CPU (s) GPU (s) 


CPU (s) 


GPU (s) 


SPEEDUP 




10 


15 


0,0582 4,2128 


0,0535 


0,0014 


38,5 


4 


50 


59 


0,4606 25,2468 


0,4704 


0,0145 


32,4 




100 


119 


1,4212 53,8550 


1,4954 


0,0563 


26,6 




10 


3 


0,0595 4,7471 


0,0545 


0,0012 


46,2 


8 


50 


50 


0,5706 30,8801 


0,5919 


0,0157 


37,8 




100 


100 


1,8734 65,3028 


1,9809 


0,0605 


32,7 




10 


10 


0,0743 6,0570 


0,0702 


0,0018 


38,9 


16 


50 


55 


0,9131 35,1509 


0,9075 


0,0300 


30,3 




100 


111 


3,3745 70,2651 


3,3425 


0,1181 


28,3 




10 


10 


0,0961 6,2251 


0,0932 


0,0029 


32,7 


32 


50 


50 


1,4913 35,3596 


1,4449 


0,0523 


27,6 




100 


100 


5,3758 74,8078 


5,3725 


0,2135 


25,2 


MEAN 












33,1 
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In this case, the GPU demonstrated to be many times 
more efficient than the CPU (see Table 1). In the GPU- 
based testing implementation several input patterns can 
be categorized in parallel, deeply exploiting the GPU 
streaming programming model. As it is shown in Table 
1, testing process can perform the classification of 32- 
component patterns between 100 different categories 
at a rate of 4.68 x 10 5 patterns per second and classify 
4-component patterns between 15 different categories 
at a rate of 7.14 x 10 6 patterns per second. 



FUTURE TRENDS 

Described implementation of Fuzzy ART training 
algorithm on the GPU is still slower than a high-level 
programmed implementation on the CPU. In the pro- 
posed implementation patterns, which are to be learned 
by the network, are downloaded from the CPU to the 
GPU one by one causing GPU to stall, waiting for new 
data. This represents a serious bottleneck. Furthermore, 
when the number of committed categories is not very 
high, arithmetic intensity of the design is very low, 
because there are a limited number of operations that 
can be made with uploaded data. Future research tasks 
can include the use of Pixel Buffer Objects (PBOs), 
an OpenGL extension, to achieve fast asynchronous 
transfer rates from CPU to GPU memory. 



CONCLUSION 

A GPU implementation of a Fuzzy ART Neural Network 
following a stream processing model was introduced in 
this paper. This design successfully faces the problem 
of integrating both training and testing processes on a 
commodity graphics card following a stream process- 
ing model. 

Fuzzy ART testing process is performed on the GPU 
up to x46 times faster than in a CPU allowing its use for 
real-time applications which involve pattern recognition 
and decision making. Training process, though, is still 
slower on the GPU than on the CPU. 

GPUs are quickly evolving and every 6-9 months 
a new generation of improved processors is made 
publicly available. Forward compatibility of the pre- 
sented implementation for future hardware releases is 
guaranteed and greater performance can be expected 
with newer cards. 
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KEY TERMS 

ART (Adaptive Resonance Theory): Learning 
theory developed by S. Grossberg that is used in com- 
petitive neural systems and includes short-term-memory 
(STM) and long-term-memory (LTM) processes. 



Fuzzy ART: Evolution of the ART1 neural network 
capable of learning normalized analog input patterns 
in an unsupervised way through the use of fuzzy op- 
erators. 

Fuzzy Logic: Mathematical method originated 
from the fuzzy set theory, which allows the partial 
membership of elements in a set, dealing with ap- 
proximate reasoning instead of exactly deduced from 
classical logic. 

GPGPU (General-Purpose computation on 
GPUs): A recent trend in computer science consisting 
in the use of the Graphics Processing Unit (GPU), for 
doing expensive computational tasks rather than just 
computer graphics. 

GPU (Graphics Processing Unit): A dedicated 
graphics rendering device very efficient at manipulating 
and displaying computer graphics, thanks to its highly 
parallel structure. 

Neural Classifier: An artificial neural network 
utilized to identify input patterns as members of a pre- 
defined class (supervised classification) or as members 
of an unknown class (unsupervised classification). 

Stream Processing: A paradigm for the execution 
of parallel processing operations exploiting data- 
level parallelism rather than task-level parallelism 
that provides incredible performance with minimal 
programming effort. 
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INTRODUCTION 

This article presents a real-time Fuzzy ART neural clas- 
sifier for skin segmentation implemented on a Graphics 
Processing Unit (GPU). GPUs have evolved into pow- 
erful programmable processors, becoming increasingly 
used in time-dependent research fields such as dynamics 
simulation, database management, computer vision 
or image processing. GPUs are designed following a 
Stream Processing Model and each new generation of 
commodity graphics cards incorporates rather more 
powerful and flexible GPUs (Owens, 2005). 

In the last years General Purpose GPU (GPGPU) 
computing has established as a well-accepted applica- 
tion acceleration technique. The GPGPU phenomenon 
belongs to larger research areas: homogeneous and 
heterogenous multi-core computing. Research in these 
fields is driven by factors as the Moore 's Gap. Today's 
uni-processors follow a 90/100 rule, where 90 percent 
of the processor is passive and 10 percent is doing 
active work. By contrast, multi-core processors try to 
follow the same general rule but with 10 percent pas- 
sive and 90 percent active processors when working at 
full throughput. Single processor Central Processing 
Units (CPUs) were designed for executing general 
purpose programs comprised of sequential instructions 
operating on single data. Designers tried to optimize 
complex control requirements with minimum latency, 
thus many transistors in the chip are devoted to branch 
prediction, out of order execution and caching. 



In the article Stream Processing of a Neural Clas- 
sifier I several terms and concepts related to GPGPU 
were introduced. A detailed description of the Fuzzy 
ART ANN implementation on a commodity graphics 
card, exploiting the GPU's parallelism and vector 
capabilities, was given. In this article, the aforemen- 
tioned Fuzzy ART GPU-designed implementation is 
configured for robust real-time skin recognition. Both 
learning and testing processes are done on the GPU 
using chrominance components in TSL (Tint, Satura- 
tion and Luminance) color space. The Fuzzy ART ANN 
implementation recognizes skin tone pixels at a rate of 
270 fps on an NVIDIA GF7800GTX GPU. 



BACKGROUND 

Human body parts detection has important applications 
as a first step in many high-level computer vision tasks 
such as personal identification, video indexing systems 
and Human-Machine Interfaces (HMI). HMI needs 
real-time video processing while consuming as few 
system resources as possible. Skin color is widely used 
as a cue for detecting and tracking targets containing 
skin, such as faces and hands in an image. The final 
objective of skin color detection is to build a decision 
rule to segment skin and non-skin pixels in an image 
efficiently. The simplest solution defines skin colors as 
those that have a certain range of values in the coordi- 
nates of a color space. OpenVidia was one of the first 
computer- vision oriented developments able to run skin 
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tone segmentation on the GPU (Fung, 2005). For this 
purpose OpenVidia uses RGB (Red, Green and Blue) 
to HSV (Hue, Saturation and Value) color conversion 
and threshold filtering. 

Statistical approaches for skin segmentation are 
based on the assumption that skin colors follow a certain 
distribution which can be estimated. These approaches 
normally make use of the chrominance components in 
a color space, thresholds and tunable parameters. 

Neural Network approaches have been proposed to 
learn skin color distribution. Karlekar et al. used a MLP 
neural network to classify pixels into skin and non-skin 
colors (Karlekar & Desai, 1999). More complex models 
have been proposed to deal with changing conditions, 
such as varying illumination in the images. Sahbi et 
al. used an ANN for coarse level skin detection, and 
then the areas found were subjected to Gaussian color 
modeling with a fuzzy clustering approach (Sahbi 
& Boujemaa, 2000). Martinez-Zarzuela et al. used a 
GPU-based Fuzzy ART ANN implementation to learn 
skin colors in TSL (Tint, Saturation and Luminance) 
color space (Martinez-Zarzuela, Diaz, Gonzalez, Diez 
& Anton, 2007). In their system, Fuzzy ART catego- 
rization process takes advantage of every fragment 
processor available in the GPU, so that several pixels 
can be tested simultaneously by the network, allowing 
recognition at high frame rates. 

Some other researchers have made efforts for 
integrating different kinds of ANNs on the GPU for 
speeding up specific applications. Oh et al. developed a 
GPU-based MLP for text area classification in an image; 
achieving almost 20 times speed up over a CPU (Oh & 
Jung, 2004). Luo et al. implemented a MLP on the GPU 
for real-time ball recognizing and tracking in a soccer 
robot contest (Luo, Liu & Wu, 2005). Steinkraus et al. 
proposed using graphics cards for OCR and on-line 
handwritten recognition (Steinkraus, Simard & Buck, 
2005). Finally, Bernhard et al. developed two image 
segmentation algorithms using spiking neural networks 
on the GPU (Bernhard & Keriven, 2006). 



body parts. Color processing has low computational 
cost and is robust against geometrical transformations 
(e.g. rotation, scaling, transfer and shape changes). 
However, factors such as non-idealities in color cameras 
and illumination conditions can spoil the performance 
of filtering-based applications. 

Color can be decomposed into three different 
components, one luminance and two chrominance 
components. Several researches have proved that skin 
colors have a certain invariance regarding chrominance 
components. Skin tone and lighting mainly affect the 
luminance value (Hsieh, Fan & Lin, 2005). 

Different color spaces separating chrominance and 
luminance components have been used for skin color 
segmentation: YIQ, YCbCr, CIE-Lab, CIE-Luv, HSV, 
IHS and TSL (Phung, Bouzerdoum & Chai, 2005). In 
TSL color space (Terrillon, David & Akamatsu, 1998), 
a color is specified in terms of Tint (T), Saturation (S) 
and Luminance (L) values. TSL has been selected as 
the best color space to extract skin color from complex 
backgrounds (Duan-sheng & Zheng-kai, 2003) because 
it has the advantage of extracting a given color robustly 
while minimizing illumination influence. The equations 
to obtain the T, S and L components in normalized 
TSL space are: 



T = — arctan 

271 



^ 



vy j 



1 

+ 2' 



(r 2 + g 2 ), 



L = 0.299K - 0.587G + 0.114B, 



(1) 



(2) 
(3) 



where r' = (r - 1/3) and g' = (g - 1/3), being r and g 
the chrominance components of the normalized rgb 
color model. The values of T, S and L are normalized 
in the range [0,1]. For R = G- B (achromatic colors), 
T = 5/8 and S = Oare taken. 



STREAM PROCESSING FOR 
ANN-BASED SKIN RECOGNITION 

TSL Color Space 

Color filtering is a powerful tool in computer vision ap- 
plications including the detection and tracking of human 



Fuzzy ART Off-Line Training on the GPU 
for Skin Recognition 

Adaptive Resonance Theory (ART) systems are com- 
prised of three layers or fields of nodes. Fuzzy ART 
is an extension of the original ART 1 system that 
incorporates computations from fuzzy set theory into 
the ART network, and thus making it possible to learn 
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and recognize both analog and binary input patterns 
(Carpenter, Grossberg& Rosen, 1991). The first field F Q 
represents the input pattern; the upper fieldF 2 represents 
the active code or category of the input pattern being 
selected; the middle layer F 1 receives both bottom-up 
inputs from F Q and top-down inputs from F r The F Q 
activity vector is denoted by j = ( J x , ... , l M ) where each 
component I. is within the [0,1] interval. A useful rule 
for avoiding proliferation of categories is complement 
coding. If a represents the on-response of the pattern, 
each component of the off-response a c is defined as 
a- = 1 - a t . Then, the complement coded input comes 
I =(a,a c ) = (a 1 ---a M ,a c 1 '--a c M ) and | J \=M forevery 
input pattern. 

In order to train the ANN for skin recognition, comple- 
ment coded TS features can be chosen, so that input 
patterns are defined as I = (a, a) = (T, S,l- T,l- S). This 
way, in a GPU implementation, each feature vector can 
be stored using a single texel in an RGBA texture 

Each node of the F 2 field has an associated weight 
vector or Long-term Memory (LTM) trace w. = 
(w. 1 ,...,w. M ) which subsumes information both from 
bottom-up and top-down weight vectors. Initially, 
all weights are set to one, so each category is said to 
be uncommitted. When a category is first selected it 
becomes committed and the corresponding node in F 2 
re-adapts its associated weights w.. For each input I 
and F 2 node j, the choice function T, is defined by: 



I Jaw. 
a+\w. 



(4) 



where the fuzzy MIN operator a is defined by (p. a q.) = 
min(p., q.) and the norm |»| is defined by \ p\= 2^ i=1 \Ptl 
The system is said to make a category choice when at 
least one F 2 node becomes active when an input pattern 
is presented at the F entrance. 

The category choice is indexed by J, where T 3 = 
max(T. : j = 1..JV). Then, w T is said to be a fuzzy subset 
of I and it is fed down from F 2 in order to measure its 
resemblance to the input pattern I . The system enters 
in resonance if the match function meets the vigilance 
criterion: 



I I A W; 



P- 



Fuzzy ART implementations on the CPU sequen- 
tially compute the activity for every node in field F 2 
(4). Then, a sort operation is executed in order to know 
which neuron is most fired by the input pattern. If the 
category stored in this neuron resembles enough to the 
input pattern (5), its associated weights are updated 
with the new information; otherwise, next most fired 
neuron must be analyzed. Fuzzy ART implementations 
following a stream programming model can compute the 
activity of every output neuron simultaneously. Moreover, 
on a GPU it is possible to take advantage of processing 
units devised to operate on vector data, and thus to select 
the most fired neuron whose match rate is bigger than 
a vigilance parameter p at once. By using complement 
coding we drastically reduce proliferation of categories 
and force | J | to be constant (| I \=M = 2) for every input 
pattern. This also allows for avoiding extra computing 
when calculating the match rate (5). In case vigilance 
criterion is met and training is switched on, vector w } 
must be updated using: 




w, 



= (3(/Aw J 0/d ) + (l-|3)w J 0/d . 



(6) 



(5) 



In a GPU implementation of a Fuzzy ART ANN 
devised for skin recognition LTM traces have 4 
components and can be stored in a one-dimensional 
RGBA texture W. This texture should be long enough 
to contain as many categories as could be committed 
during training process. However, only first N texels 
containing information from committed neurons must 
participate in the training process when computing T. 
This can be done on the GPU using scissoring, which 
allows rendering a quad of dimensions lxN which does 
not cover the whole texture. Scissoring can be used 
also for updating just those texels that should change 
during training process (6). 

Training patterns can be extracted from images con- 
taining skin regions. For the experimental results shown 
in this paper, skin regions were carefully selected from 
3056-image Faces96 database (Spacek, 1996). Skin 
color distribution was estimated as a normal distribu- 
tion through the Minimum Covariance Determinants 
(MCD) estimator (Rousseeuw & Driessen, 1999) and 
a total of 671438 input vectors were selected to train 
the ANN depending on their mahalanobis distance to 
the mean color of the modeled distribution. The ANN 
was trained fixing a parameter to 0.001 and varying the 
vigilance parameter in different training tests. Table 1 
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Table 1. Number of committed categories varying p 



0.90 0.93 0.95 0.97 



N 



15 



Figure 1. Skin regions belonging to different committed categories varying p 





a) RGB 



b)p = 0.90 





d)p = 0.95 



e)p = 0.97 



shows the number of committed categories depending 
on the value of p. 

The larger the value of the vigilance parameter, 
the larger the number of committed skin categories by 
the network as level of resemblance between patterns 
belonging to different categories increases. Figures 
from lb) to le) show different regions identified as 
skin categories by the network with p increasing from 
0.90 to 0.97 respectively. 

Fuzzy ART Real-Time Skin Recognition 
on the GPU 

Once the ANN has been trained, computed LTM traces 
contain all the information that it is needed for skin 
recognition. Video sequences that have to be processed 
can be acquired using a conventional USB Webcam and 
every new frame can be uploaded to the GPU memory 
and stored in an RGBA texture. Then, a shader can 
be used to convert (R,G,B,A) color space pixels into 
(T,S,1-T,1-S) feature vectors, which will be the input 
for the Fuzzy ART ANN. 

During skin recognition, several input patterns can 
be categorized in a parallel fashion using every frag- 
ment processor available on the GPU. Category choice 
occurs through the execution of a shader for N times, 
being N the number of categories in field F r In each 
pass, the activation of the ]th output neuron (5) and the 
match rate (6) are computed for every input pattern and 



rendered into an RGBA output texture, which will also 
contain the category index associated to each pattern. 
This RGBA texture and texture containing feature 
vectors are used as inputs for the next iteration, using 
the ping-pong technique. If the activation in pass j is 
bigger than the computed activation in passy-1 and 
the match criterion is satisfied, then the category index 
is updated in the output texture. Finally, a post-pro- 
cessing stage can be used to generate an image where 
those pixels not belonging to any skin category are not 
rendered to the screen. Fig. 2 shows a global scheme 
of the system and the evolution of the skin recognition 
process through different shader passes. 

Rendering both the index of the selected category 
and the match rate to the output texture is useful for 
analyzing results achieved. Different gray levels on 
channel R represent different skin categories commit- 
ted during training process; on channel A, a value in 
the range [0,1] represents the level of resemblance of 
every pixel in the original image to the selected skin 
category. 

Figure 3 shows two images categorized by the ANN 
using different p values. As p increases, both hit rate and 
false alarm rate decrease. With p = 0.90 almost every 
skin pixel is correctly recognized, but several non-skin 
pixels (e.g. from the purple glasses) are included in 
some skin category by the network. These pixels are 
correctly not recognized as skin with p = 0.97. 
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Table 2 shows the performance of the system for 
different resolutions running on a dual-core 3.2 GHz 
Pentium 4 with 1GB RAM, GeForce 7800GTX 256 
MB GPU (containing 24 fragment processors) and a 
generic webcam able to capture up to 90 fps at resolu- 
tions of 640x480 pixels. As the value of p and resolution 
increase, frame rate decreases. The number of frames 
that can be processed by the network strongly depends 
on the number of input vectors and the number of com- 
mitted categories every pixel has to be tested to. Best 
performance is 270 fps, for a resolution of 320x240 
pixels and p = 0.90. 



FUTURE TRENDS 

Described implementation of the GPU-based skin 
recognition system in this article was developed using 
a combined C++ / OpenGL (Shreiner, Woo, Neider 
& Davis, 2005) / Cg solution (Randima & Kilgard, 
2003), and the algorithm had to be translated into 
graphics terms so that it could be mapped to the GPU 
(Harris, 2005). However, newer graphics cards from 
NVIDIA can be programmed using the CUDA(Compute 
Unified Device Architecture) software development 
kit. Before CUDA was available GPGPU required 
the use of a graphics API, which presents the wrong 
abstraction for general-purpose parallel computation, 
making GPGPU applications difficult to write, debug, 
and optimize. CUDA enables direct implementation 
of parallel computations in the C language using an 




Figure 2. Global system architecture 




Figure 3. Skin recognition performance varying p 



Mi i 



(a) RGB 



(b)p = 0.90 



(c)p = 0.95 



. 
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(d) RGB 



(e)p = 0.93 



(f)p = 0.97 
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Table 2. Frames per second for different resolutions and p values 



Resolution 


320x240 pixels 


352x288 pixels 


640x480 pixels 


P 


0.90 


0,95 


0,97 


0.90 


0,95 


0,97 


0.90 


0,95 


0,97 


fps 


270 


89 


42 


212 


68 


32 


71 


23 


11 



API designed for general-purpose computation. It also 
includes standard FFT and BLAS libraries that will 
help researchers from different areas to exploit GPUs 
computational performance. 



CONCLUSION 

An implementation of a GPU-based Fuzzy ART Neural 
Network for real time skin recognition was introduced 
in this paper. This design successfully faces the problem 
of using a neural network for pattern classification when 
time is a major requirement. A robust and complete set 
of skin colors and a good selection of input features 
(chrominance components of TSL color space) are 
necessary to train the network so that it can recognize 
skin in real changing conditions. 

Experimental results show system achieves excel- 
lent performance with an NVIDIA 7800GTX GPU 
video card, which includes 24 fragment shaders in 
the pipeline. Fuzzy ART skin recognition on the GPU 
can be the first stage in a complex computer vision 
application, like a human-machine interface or a video 
vigilance system. 
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KEY TERMS 

ART (Adaptive Resonance Theory): Learning 
theory developed by S. Grossberg that is used in com- 
petitive neural systems and includes short-term-memory 
(STM) and long-term-memory (LTM) processes. 



CUDA: A GPGPU technology that allows a pro- 
grammer to use the C programming language to code 
algorithms for execution on the GPU. CUD Arequires an 
NVIDIA GPU and special stream processing drivers. 

Fuzzy ART: Evolution of the ART1 neural network 
capable of learning normalized analog input patterns 
in an unsupervised way through the use of fuzzy op- 
erators. 

GPGPU (General-Purpose Computation on 
GPUs): A recent trend in computer science consisting 
in the use of the Graphics Processing Unit (GPU), for 
doing expensive computational tasks rather than just 
computer graphics. 

GPU (Graphics Processing Unit): A dedicated 
graphics rendering device very efficient at manipulating 
and displaying computer graphics, thanks to its highly 
parallel structure. 

Heterogeneous Multi-Core Computing: Design 
and analysis of algorithms and applications for hetero- 
geneous multi-core processor architectures (e.g. IBM 
Cell processor). 

Homogeneous Multi-Core Computing: Design 
and analysis of algorithms and applications for ho- 
mogeneous multi-core processor architectures (e.g. 
GPUs). 

Moore's Gap: Refers to the relatively modest 
incremental performance gains brought about by the 
increased number of transistors on current uni-proces- 
sor dies despite increases in clock speeds. 

Stream Processing: A paradigm for the execution 
of parallel processing operations exploiting data- 
level parallelism rather than task-level parallelism 
that provides incredible performance with minimal 
programming effort. 

TSL Color Space: Color space based on Intensity 
Hue Saturation (IHS) color model. Acolor in this space 
is specified by Tint (T), Saturation (S) and Luminance 
(L) values. 
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INTRODUCTION 

Performance of genetic algorithms (GAs) is mainly 
determined by several factors. Not only the genetic 
operators affect the performance of a GA with vary- 
ing degrees, but also the parameter settings for genetic 
operators interact in a complicated manner with each 
other in influencing a GA's performance. Though many 
studies have been conducted for this cause, they failed 
to converge to consistent conclusions regarding the 
importance of different genetic operators and their 
parameter settings on the performance of GAs. Actu- 
ally, optimizing the combinations of different strategies 
and parameters for different problem types is an NP- 
complete problem in itself, and is still an open research 
problem for GAs (Mitchell, 1996). 

Recognizing the intrinsic difficulties in finding uni- 
versally optimal parameter configurations for different 
classes of problems, we advocate the experience-based 
approach to discovering generalized guiding rules for 
different problem domains. To this end, it is necessary 
for us to gain a better understanding about how differ- 
ent genetic operators and their parameter combinations 
affect a GA's behavior. In this research, we systemati- 
cally investigate, through a series of experiments, the 
effect of GA operators and the interaction among GA 
operators on the performance of the GA-based batch 
selection system as proposed in Deng (2007). This 
paper intends to serve as an initial inquiry into the 
research of useful design guidelines for configuring 
GA-based systems. 



PARAMETER CONFIGURATION FOR 
GENETIC OPERATORS 

It is commonly believed that crossover is the major 
operator of GAs, with mutation preventing the popula- 
tion from early convergence to a certain solution before 
an extensive exploration of other candidate solutions 



is made (Holland, 1992a). Crossover enables GAs 
to focus on the most promising regions in a solution 
space; however, mutation alone does not advance the 
search for a solution. Crossover is also a more robust 
constructor of new candidate solutions than mutation 
(Spears, 1993). 

However, Muhlenbein(1992) argues that the power 
of mutation has been underestimated in traditional 
GAs. According to Mitchell (1996), it is not a choice 
between crossover or mutation but rather the balance 
among crossover, mutation, and other factors, such as 
selection, that is all important. The correct balance also 
depends upon the details of the fitness function and the 
encoding. Furthermore, crossover and mutation vary in 
relative usefulness over the course of a run. Actually, 
the theretical analysis of crossover is still to a large 
extent an open problem (Back, et a/., 1997). 

In addition to the GA operators, the population size 
also affects the performance of GAs. The specification 
of the population size affects the diversification of the 
population body and the implicit parallelism of a GA, 
and will thus affect the quality of the generated solu- 
tions and the performance of the solution-generating 
process. Choosing an appropriate population size for a 
GA is a necessary but difficult task for GA users. Usu- 
ally, the parameter settings for most GA applications 
are based on De Jong's recommendations (De Jong, 
1975). According to De Jong's experiments with five 
problems in function minimization, the best population 
size was 50~100, the best crossover rate was about 0.6, 
and the best mutation rate was 0.001. In a later study, 
Spears & De Jong (1991) suggested a wider range for 
the crossover rate as 0.5~0.8. Mitchell (1996) also 
observed that it was common in GA applications to 
set crossover rate at 0.7~0.8. 

However, Schaffer et al. (1991) asserted that the 
best settings for population size, crossover rate, and 
mutation rate were independent of the problems. In 
their study of a small set of numerical optimization 
problems, a very small population of size 20~30 with 
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a large crossover rate ranging from 0.75 to 0.95, and 
with a very small mutation rate ranging from 0.005 to 
0.01 would produce the best perofrmance. Gref enstette 
(1993) also reached similar conclusions in his study of 
parameter optimization for GAs, and suggetsed the fol- 
lowing settings: population size 30, crossover rate 0.95, 
andmutationrateO. 01. While Schaffer eta/. (1991) and 
Grefenstette (1993) advocated a very small population 
size, Goldberg (1989) and Liao & Sun (2001) argued 
for a much larger population size. 

From the above discussion, the diverse recommenda- 
tions on population size seem to indicate that population 
size interacts with some other factors not included in 
the previous research. In this paper, we investigated 
the effect of interaction among different parameters 
on a GA's performance. However, the choice of muta- 
tion rate needs to take into account, at least, the task 
complexity of an application. According to Mitchell 
(1996), it is impossible to specify an optimal setting 
for parameters in all different applications. 



EXPERIMENTAL DESIGN 

We focus mainly on investigating the effects of dif- 
ferent combinations of parameter settings for genetic 
operators on our GA's performance, and compare our 
results with the claims made by previous research. We 
discuss the factors and parameter settings experimented 
in this study as below: 

Task complexity: Since the length of the solution 
string is usually a function of the complexity of the 
problem, we experiment with different parameter 
settings for two batch selection tasks of different 
complexity levels. One task has 30 products to 
be manufactured, 10 available tools, and 8 avail- 
able machines. The other less complex task has 
12 products, 6 available tools, and 4 available 
machines. 

Representation scheme: In this paper, we con- 
sider a common situation in FMSs in which if a 
product is selected in a batch for manufacturing, 
the entire quantity specified in the production 
table must be produced in the shift. Under this 
assumption, our batch selection task becomes 
a pseudo-Boolean optimization problem. This 
enables us to use a single binary bit to represent 
a component in a candidate solution. Therefore, 



each candidate solution to the batch selection task 
can be encoded as a binary string of fixed length 
P, where P is the cardinality of the entire set of 
products under consideration. 
Population size: If the population size is too 
small, the GA will converge too quickly to find the 
optimal solution; however, if the population size 
is too large, the computation cost will be prohibi- 
tive. In this research, we investigate the effect of 
population sizes 10, 100, and 200, representing 
Small, Medium, and Large, on generating solu- 
tions for our batch selection problem. 
Selection strategy: We adopted the elitism strat- 
egy so that the best candidate solutions at each 
generation could be retained for the next genera- 
tion. Though elitism is used to prevent the elite 
solution strings in a population from being altered 
by crossover or mutation, retaining too many elite 
individuals might cause the domination of the 
entire population by suboptimal, though highly fit, 
solution strings. This might lead to degeneration 
for the population eventually. The usual practice 
is to retain a small number of elite candidate 
solutions (Goldberg, 1989). In this research our 
system preserves two fittest candidate solutions 
on each iteration of forming new population. 
Crossover parameter: We adopt the standard 
crossover operator, i.e., the one-point crossover. 
The crossover rate is the probability that the 
crossover operator will be applied to a pair of 
candidate solutions selected for reproduction. 
In order to re-examine the different claims by 
previous research on the importance of different 
crossover rates, we experiemnt with three differ- 
ent crossover rates: 0.1, 0.5, and 0.9, representing 
three levels: High, Medium, and Low. 
Mutation parameter: The parameter mutation 
rate is used to control the rate of diversification 
via probabilistic conversion of each bit value 
in a candidate solution. However, a mutation 
rate approaching 1 will theoretically lead to a 
completely stochastic search with no succession 
from generation to generation. The usual prac- 
tice is applying an occasional mutation to make 
a random change in the elements of a solution 
string. There are also various conclusions from 
previous research regarding the mutation rate. In 
this research we experiment with three different 
mutation rates: 0.001, 0.01, and 0.5, representing 
three levels: Low, Medium, and High. 
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Termination criteria: Atermination criterion can 
be a specified maximum number of generations, 
a target objective function value, a convergence 
threshold, or a lack of improvement in the best 
solution over a specified number of generations. 
In this research, our system will terminate when 
there is no improvement in the best solutions over 
50 consecutive generations. 



PERFORMANCE ANALYSIS FOR 
COMPUTATIONAL EXPERIMENT 

The optimal-batch search process is conducted for each 
parameter combination until 50 feasible solutions are 
generated. We experimented with the combinations 
of three population sizes, three levels of the crossover 
rate, and three levels of the mutation rate for two tasks. 
Altogether, we conducted 4381 times of experimenta- 
tion in generating 2700 feasible solutions. Performance 
analysis for each parameter setting is discussed for each 
of the two tasks under study. 

Performance Analysis for the Task with 
Higher Complexity 

The result of our experiment for the higher-complex- 
ity task is shown in Table 1. As suggested by Mitchell 
(1996), a GA's behavior had better be understood and 
described by macroscopic statistics, such as mean 
fitness in the population. Therefore, we compute the 
average performance and standard deviation in Table 
1. The average performance of each parameter com- 
bination is obtained by averaging the best results over 
50 feasible solutions. 



From Table 1, we find out that for all different 
combinations of crossover rates and mutation rates, 
the average performance for Pop Size = 200 is always 
the best. This implies that there is no strong interaction 
among the three parameters. In addition, the standard 
deviation column indicates that when the population 
size is larger, the fluctuation of performance from dif- 
ferent runs of generating optimal solutions tends to be 
smaller. Though populations of size 200 would yield 
the best performance in our experiment, the number of 
runs of simulation for generating 50 feasible solutions 
is also the largest. Populations of size 10 are most likely 
to generate feasible solutions which tend to have the 
lowest performance. 

If we look across all different population sizes, it 
seems that when the crossover rate is set at a low value, 
e.g., 0.1, the mutation rate should be set at a very small 
value for the best result. When the crossover rate is 
set at a medium or high level, the mutation rate 0.01 
favors the performance the most. Overall, there seems 
to have a tendency that across all levels of population 
size, Mutation Rate = 0.01 and a medium- or high- 
level crossover rate will generate the best result. This 
implies that there might have an interaction between 
the crossover rate and the mutation rate. 

Across all different levels of the crossover rate, the 
combination of Mutation Rate = 0.01 and Pop Size = 
200 tends to consistently yield the best result. This 
implies that there is a strong interaction between the 
population size and the mutation rate. Across all levels 
of the mutation rate, there is also a consistent pattern 
of effects on the system performance among different 
combinations of population sizes and crossover rates: 
the population size 200 combines with high crossover 
rates in generating the best result in Table 1. This seems 



Table 1. Performance of different population sizes under different mutation rates and crossover rates 



Crossover 
Rate 


Pop. 
Size 


Mutation Rate = 0.001 


Mutation Rate = 0.01 


Mutation Rate = 


0.5 


Ave 


Stdev 


#F/#T 


Ave 


Stdev 


#F/#T 


Ave 


Stdev 


#F/#T 


0.1 


10 


88.25 


3.37 


50/50 


87.60 


2.69 


50/52 


85.54 


2.57 


50/53 


100 


93.02 


1.48 


50/56 


93.25 


1.69 


50/74 


91.41 


2.13 


50/79 


200 


93.76 


1.45 


50/63 


93.70 


1.30 


50/98 


92.65 


2.10 


50/93 


0.5 


10 


89.43 


2.69 


50/50 


89.64 


3.05 


50/51 


85.86 


2.75 


50/53 


100 


93.21 


1.58 


50/51 


93.37 


1.49 


50/54 


91.68 


2.03 


50/89 


200 


93.79 


1.21 


50/52 


93.86 


1.21 


50/60 


92.19 


1.74 


50/115 


0.9 


10 


88.73 


3.27 


50/50 


89.78 


2.40 


50/50 


85.63 


3.45 


50/52 


100 


93.39 


1.32 


50/50 


93.58 


1.64 


50/50 


91.48 


1.94 


50/64 


200 


93.81 


1.27 


50/50 


94.30 


1.39 


50/53 


92.40 


1.59 


50/89 



(Note: # F/#T: the ratio of the number of feasible solutions generated to the total number of simulation runs.) 
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Table 2. Performance of different population sizes under different mutation rates and crossover rates 



Crossover 
Rate 


Pop. 
Size 


Mutation Rate = 


0.001 


Mutation Rate = 


0.01 


Mutation Rate = 


0.5 


Ave 


Stdev 


#F/#T 


Ave 


Stdev 


#F/#T 


Ave 


Stdev 


#F/#T 


0.1 


10 


88.24 


3.13 


50/50 


89.30 


2.51 


50/51 


90.25 


1.60 


50/50 


100 


91.25 


0.68 


50/57 


91.15 


0.81 


50/57 


91.87 


0.25 


50/124 


200 


91.32 


0.56 


50/55 


91.39 


0.62 


50/69 


91.89 


0.21 


50/309 


0.5 


10 


89.37 


2.36 


50/50 


89.65 


1.99 


50/50 


89.90 


1.58 


50/51 


100 


91.22 


0.60 


50/50 


91.22 


0.60 


50/50 


91.74 


0.41 


50/108 


200 


91.29 


0.66 


50/50 


91.64 


0.48 


50/50 


91.91 


0.15 


50/431 


0.9 


10 


89.00 


2.48 


50/50 


89.55 


1.87 


50/50 


90.17 


1.64 


50/51 


100 


91.17 


0.68 


50/50 


91.26 


0.71 


50/50 


91.76 


0.40 


50/99 


200 


91.45 


0.56 


50/50 


91.49 


0.53 


50/50 


91.89 


0.21 


50/518 




(Note: # F/#T: the ratio of the number of feasible solutions generated to the total number of simulation runs.) 



to indicate there is a strong interaction between the 
population size and the crossover rate. 

Performance Analysis for the Task with 
Lower Complexity 

The result of our experiment for the other task is shown 
in Table 2. From Table 2, Mutation Rate 0.5 tends to 
produce the best result with the smallest deviation. The 
same observations also hold for the Pop Size = 200. 
This indicates that there is no strong interaction among 
these three parameters. However, Mutation Rate = 0.5 
and Pop Size = 200 also have the lowest number of 
feasible solutions. On the other hand, Pop Size 10 and 
Mutation Rate 0.001 are most likely to generate feasible 
solutions which tend to have the lowest performance. 
From Table 2, we cannot identify any consistent pattern 
of performance for the crossover rate. Similar to the 
previous case, the interaction between the crossover 
rate and the mutation rate does not have a consistent 
pattern of influence on the system performance across 
different population sizes. This implies the lack of 
strong interaction between the crossover rate and the 
mutation rate for the current case. 

Overall, the combination of Pop Size = 200 and 
Mutation Rate = 0.5 seems to give the best result for all 
different levels of the crossover rate. This implies that 
there is a significant interaction between the population 
size and the mutation rate. However, we cannot identify 
a consistent pattern for the combination of the popula- 
tion size and the crossover rate or the mutation rate and 
the crossover rate. This implies that there is lack of an 
interaction within these two pairs of parameters. 



FUTURE TRENDS AND CONCLUSION 

Though Schaffer et al (1991) and Grefenstette (1993) 
advocate a very small population size, our analyses 
for both tasks of high complexity and low complexity 
indicate that larger populations will generally favor 
the performance of our batch selection system more 
than smaller populations. Our result is consistent with 
Liao & Sun (2001). With the availability of a larger 
pool of diverse schemata in a larger population, our 
GA system will have a broader view of the "landscape" 
(Holland, 1992b) of the solution space, and is thus 
more likely to contain representative solutions from 
a large number of hyperplanes. This advantage gives 
a GA more chances of discovering better solutions in 
the solution space. However, Davis (1991) argues that 
the most effective population size is dependent upon 
the nature of the problem, the representation formal- 
ism, and the GA operators. We plan to analyze the GA 
performance for another application domain so that we 
can be more conclusive on the issue of the effective 
population size. 

Though the solution performance of small popu- 
lations is lower than that of large populations, the 
efficiency of small populations in generating feasible 
solutions, i.e., the ratio of number of feasible solutions 
to the total number of runs required to generate a cer- 
tain number of feasible solutions, is indeed better than 
large populations, especially when the mutation rate is 
high. This can be evidenced by the #F/#T columns of 
Tables 1 and 2. In this sense, Schaffer et al. (1991) and 
Grefenstette (1993) are correct in their recommendation. 
This might be due to the fact that small populations 
have higher probability of developing the premature 
convergence problem. 
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Our analysis shows that our two tasks do not agree on 
the recommendation for the mutation rate. The task with 
higher complexity prefers a very small mutation rate, 
especially 0.01; while the less complicated task prefers 
a very large mutation rate, such as 0.5. In addition, high 
crossover rates will be better for complex tasks; while 
there is no conclusion for simple tasks. Contrary to the 
general belief regarding the maj or role of crossover, we 
did not find out crossover was as a determinant factor 
as population size or mutation rate in influencing the 
system performance. Part of our findings is similar to 
that of Pendharkar & Rodger (2004), who compared the 
performance of different types of crossover operators, 
including arithmetic, uniform, and one-point operators, 
for the design of GA-based artificial neural networks 
and found no significant difference among them. In 
addition, our findings on the role of mutation rate for 
tasks of different complexity complement Muhlenbein 
(1992) who contends that the power of mutation has 
been underestimated in traditional GAs. 

Our analysis also shows mutation and crossover 
interact with the population size in different ways. 
The effect of mutation is strongly influenced by the 
population size in both tasks. For the task with higher 
complexity, the combination of a very large population 
size, such as 200, and a small mutation rate, such as 0.01, 
tends to generate a very good result. However, the less 
complex task needs a very large population and a very 
large mutation rate, such as 0.5, in order to yield the 
best results. On the other hand, the interaction between 
crossover and the population size is only found with the 
task of high complexity, and the interaction between 
mutation and crossover is barely found with the task of 
higher complexity only. More research work needs to be 
performed in order to understand better how the eff etcs 
of crossover and mutation depend upon other details 
of a GA, such as the population size, the application 
domain, the fitness function, encoding, and selection. 
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KEY TERMS 

Batch Selection: Selecting the optimal set of prod- 
ucts to produce, with each product requiring a set of 
resources, under the system capacity constraints. 

Fitness Function: The objective function of the 
GA for evaluating a population of solutions. 

Genetic Operators: Selection, crossover, and 
mutation, for combining and refining solutions in a 
population. 

Implicit Parallelism: A property of the GA which 
allows a schema to be matched by multiple candidate 
solutions simultaneously without even trying. 

Landscape: A function plot showing the state as 
the "location" and the objective function value as the 
"elevation". 

NP-Complete Problems: The hardest problems in 
the class NP — the class of nondeterministic polynomial 
problems. 

Schemata: A general pattern of bit strings that is 
made up of 1, 0, and #, used as a building block for 
solutions of the GA. 
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INTRODUCTION 

Conventionally modelling and simulation of complex 
nonlinear systems has been to construct a mathemati- 
cal model and examine the system's evolution or its 
control. This kind of approach can fail for many of 
the very large non-linear and complex systems being 
currently studied. With the invention of new advanced 
high-speed computers and the application of artificial 
intelligence paradigms new techniques have become 
available. Particularly neural networks and fuzzy 
logic for nonlinear modelling and genetic algorithms 
[Goldberg, D. (1989)] and evolutionary algorithms for 
optimisation methods have created new opportunities 
to solve complex systems [Bai, Y., Zhuang H. and 
Wang, D. (2006)]. 

This paper considers issues in design of multi-layer 
and hierarchical fuzzy logic systems. It proposes a 
decomposition technique for complex systems into 
hierarchical and multi-layered fuzzy logic sub-systems. 
The learning of fuzzy rules and internal parameters 
in a supervised manner is performed using genetic 
algorithms. The decomposition of complex nonlinear 
systems into hierarchical and multi-layered fuzzy logic 
sub-systems reduces greatly the number of fuzzy rules 
to be defined and improves the learning speed for such 
systems. In this paper a method for combining subsys- 
tems to create a hierarchical and multilayer fuzzy logic 
system is also described. Application areas considered 
are - the prediction of interest rate, unemployment rate 
predication and electricity usage prediction. 

Genetic Algorithms can be used as a tool for design 
and generation of fuzzy rules for a fuzzy logic system. 
This automatic design and generation of fuzzy rules, via 
genetic algorithms, can be categorised into two learn- 
ing techniques namely, supervised and unsupervised. 
In supervised learning there are two distinct phases 
to the operation. In the first phase each individual is 
assessed based on the input signal that is propagated 
through the system producing output respond. The ac- 



tual respond produced is then compared with a desired 
response, generating error signals that are then used 
as the fitness for the individual in the population of 
genetic algorithms. Supervised learning has success- 
fully applied to solve some difficult problems. In this 
paper design and development of a genetic algorithm 
based supervised learning for fuzzy models with ap- 
plication to several problems is considered. A hybrid 
integrated architecture incorporating fuzzy logic and 
genetic algorithm can generate fuzzy rules that can be 
used in a fuzzy logic system for modelling, control 
and prediction. 

Fuzzy logic systems typically have a knowledge 
base consisting of a set of rules of the form 

If (x 1 is A' and x_ is A ' and . . . and x is A ') 

v 1 1 2 2 n n ' 

Then (z 1 is B' 1 else z is B' 1 else . . . else z is B l ) 

v 1 1 2 2 mm' 

where A k l ;k = 1, ..., n are normalised fuzzy sets for 
n input variables x k , k = 1 ;. . ., n, and where B k l , k ; 
k = 1, ..., m are normalised fuzzy sets for m output 
variables z k , k = 1, . . .,/n. The heart of the fuzzy logic 
system is the inference engine that applies principles 
of intelligent human reasoning to interpret the rules 
to output an action from inputs. There are many types 
of inference engines in the literature, including the 
popular Mamdani inference engine, [Bai, Y., Zhuang 
H. and Wang, D. (2006)]. 

Given a fuzzy rule base with M rules and n anteced- 
ent variables, a fuzzy controller as given in Equation 1 
uses a singleton fuzzifier, Mamdani product inference 
engine and centre average defuzzifier to determine 
output variables, has the general form for a single 
output variable, say z 1 

M n 

2>i(rM'(*,)) 



i=i ;=i 



"1 M 



2>KrM'(*,)) 



(i) 
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where y[ are centres of the output sets B l k and mem- 
bership function \i defines for each fuzzy set A^ the 
value of x. in the fuzzy set, namely, [i A[ (x ( )) . Com- 
mon shapes of the membership function are typically, 
triangular, trapezoidal and Gaussian. A first step in the 
construction of a fuzzy logic system is to determine 
which variables are fundamentally important. It is 
known that the total number of rules in a system is an 
exponential function of the number of system variables 
[Raju G. V. S. and Zhou, J. (1993), Kingham, M., 
Mohammadian, M, and Stonier, R. J. (1998)]. In order 
to design a fuzzy system with the required accuracy, 
the number of rules increases exponentially with the 
number of input variables and their associated fuzzy 
sets to the fuzzy system. A way to avoid the explosion 
of fuzzy rule bases in fuzzy logic systems is to consider 
Hierarchical Fuzzy Logic systems [Raju G. V. S. and 
Zhou, J. (1993)]. Hierarchical fuzzy logic systems 
have the property that the number of rules needed to 
construct the fuzzy system increases only linearly with 
the number of variables in the system. 

The idea of hierarchical fuzzy logic systems is to put 
the input variables into a collection of low-dimensional 
fuzzy logic systems, instead of creating a single high 
dimensional rule base for a fuzzy logic system. Each 
low-dimensional fuzzy logic system constitutes a level 
in the hierarchical fuzzy logic system. Assume that 
there are n input variables x 1? ...,x n then the hierarchical 
fuzzy logic system is constructed as follows [Raju G. 
V. S. and Zhou, J. (1993)] 

• The first level fuzzy rule base for fuzzy system 
with n 1 input variables x 1? . . . ,x n which is constructed 
from the rules 

If x 1 is A[ and ... and x n is A^ v Theny : is B[ 

where 2 < n 1 < n, and / = 1,2,..., M r 
The z'th level (z > 1) fuzzy rule base for a fuzzy 
system with n. + 1 (n. > 1) input variables, which 
is constructed from the rules 

If x N is A N and ... and A l N andy. x is Then 
where 



and / = 1,2, ... , M. 

The construction of fuzzy rule bases for fuzzy 

systems continues until i=l such that 




"i = r^>j=n, 



«.-£>,. 



that is, until all the input variables are used in one 
of the levels. 

The first level has n 1 input variables x 1? ...,x n with 
one output variable y ± , which is then sent to the second 
level as input. In the second level another n 2 variables 
x ,...,x and the variable y 1 are combined to 
produce the output variable y 2 , which is then sent to 
the third level. This procedure continues until all the 
variables x 1? ...,x n are used [Raju G. V. S. and Zhou, J. 
(1993), Kingham, M., Mohammadian, M, and Stonier, 
R. J. (1998), Magdalena, L. (1998), Cordon, O., Her- 
rera, R, Hoffmann, F. and Magdalena, L. (2001)]. The 
number of rules in a hierarchical fuzzy logic system is 
a linear function of the number of input variable and 
their associate fuzzy sets [Kingham, M., Mohammad- 
ian, M, and Stonier, R. J. (1998)]. Other ways to reduce 
the fuzzy rules of a fuzzy logic system are 

1. Fusing variables before input into the inference 
engine, thereby reducing the number of rules in 
the knowledge base, 

2. Grouping the rules into prioritised levels to design 
hierarchical or multi-layered structures, 

3. Reducing the size of the inference engine directly 
using notions of passive decomposition of fuzzy 
relations, 

4. Decomposing the system into a finite number of 
reduced-order subsystems, eliminating the need 
for a large-sized inference engine. 

5. Reducing the number of fuzzy sets of each input 
variable, thereby reducing the number of rules in 
the knowledge base of fuzzy logic system. 

Using hierarchical fuzzy logic systems the typi- 
cally the most influential parameters are chosen as 
the system variables in the first level, the next most 
important parameters are chosen as the system vari- 
ables in the second level, and so on, [Raju G. V. S. and 
Zhou, J. (1993)]. In this hierarchy, the first level gives 
an approximate output which is then modified by the 
second level rule set, this procedure can be repeated 
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in succeeding levels of hierarchy. The number of rules 
in a complete rule set is so reduced to a linear func- 
tion of the number of variables, but this number may 
still be high. Further, given that different hierarchical 
and multi-layered structures can exist, how can the 
fuzzy knowledge base and associated parameters in 
each layer be effectively learnt? A learning approach 
based on genetic algorithms is discussed in this paper 
for the determination of these knowledge bases and 
associated parameters. 



VARIABLE SELECTION, RULE BASE 
LEARNING AND DECOMPOSITION 

Interest Rate Prediction 

In [Kingham, M., Mohammadian, M, and Stonier, R. 
J. (1998)], the authors used hierarchical fuzzy logic 
structures and multi-layered neural network structures 
for modelling and prediction of the Australian inter- 
est rate with 14 input variables, on actual data of key 
economic indicators that was a limited data set. Using 
expert knowledge from an economist the following 
input variables were chosen and placed into 5 different 
groupings, namely, 

1. Employment (Job Vacancies, Unemployment 
Rate) 

2. Country (Gross Domestic Product, Consumer 
Price Index ) 

3. Savings (Household Saving Ratio, Home Loans, 
Average Weekly Earnings) 



4. Foreign (Current Account, RBA Index, Trade 
Weighted Index) 

5 . Company (All Industrial Index, Company Profit, 
New Motor Vehicles) 

which then were formed into a two layered fuzzy 
system, see Figure 1. 

The current interest rate was input into each of the 
five fuzzy systems in the first layer and the final output 
of the second layer was the predicted interest rate. It 
is assumed that the first layer gives a first iteration of 
the new interest rate and they are input into the second 
layer. But the output variables from the first layer do 
not necessarily have to be identified with the interest 
rate. Assuming there are five membership sets for all 
variables, including those entering the second layer, 
there are 5250 fuzzy rules in this structure. If all fourteen 
variables were input into a single layer fuzzy logic sys- 
tem structure there would be some 6 million rules (5 16 ). 
Hence there is a considerable reduction in the number 
of rules for this simple two layered hierarchical fuzzy 
logic system structure. But it is clear that this in not 
the only decomposition that could have been formed 
in grouping the variables, or in number of levels of 
the multi-layered structure. A genetic algorithm was 
used to learn the rules in this fuzzy system, and it was 
found that the hierarchical fuzzy logic system structure 
was accurate [Kingham, M., Mohammadian, M, and 
Stonier, R. J. (1998)]. Further research on this prob- 
lem discussing different hierarchical fuzzy structures 
of three, four and five layers, and the learning of the 
fuzzy rule bases, was considered and can be found in 
[Mohammadian, M. and Kingham, M. (2004)]. 



Figure 1. Interest rate prediction 
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However there is still a question, Does a two layer 
hierarchical fuzzy logic system structure provides the 
best solution? To answer this question, one can start 
building three, four layer hierarchical fuzzy logic system 
structure by trial and error to possibly find the correct 
number of layers required. This could be cumbersome 
problem [Mohammadian, M. and Kingham, M. (2004)]. 
Genetic algorithms can be used to solve this problem 
by determining the number of layer in the hierarchi- 
cal fuzzy logic system and the correct combination of 
fuzzy knowledge bases for each layer. 

A genetic algorithm is developed in such a way to 
provide the possible best architecture for designing hi- 
erarchical fuzzy logic systems for prediction of interest 
rate in Australia [Mohammadian, M. (2002)]. Using 
the economic indicators five fuzzy logic systems were 
developed as described above. Genetic algorithms were 



then used to design and develop a hierarchical fuzzy 
logic system. The hierarchical fuzzy logic system devel- 
oped was then used to predict interest rate. For each of 
these group (as described earlier), the current quarter's 
interest rate is included in the indicators used. 

For encoding and decoding of the hierarchical fuzzy 
logic system, first a number is allocated to each fuzzy 
logic system developed from group of indicators. For 
this simulation the number allocated to each group is 
shown below 

1 = Employment, 2 = Country, 3 = Savings, 4 = 
Foreign, 5 = Company 

The number of layers and the fuzzy logic system/s 
for each layer is determined by genetic algorithms. 
Genetic algorithms randomly encode each fuzzy logic 




Figure 2. A three-layer hierarchical fuzzy logic system - 3125 fuzzy rules 
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system into a number ranging from 1 to 5 for all possible 
combinations of the fuzzy logic systems. The level in 
the hierarchy in which a fuzzy logic system is allocated 
to, is also encoded in each string representing an indi- 
vidual in a population of genetic algorithms. A string 
is encoded this way can be represented as Figure 3. 

Each individual string is then decoded into a hier- 
archical fuzzy logic system that defines the fuzzy logic 
system/s for each level of the hierarchical fuzzy logic 
system. The above string once decoded will provide 
a hierarchical fuzzy logic system as shown in Figure 
2 above. The set of hierarchical fuzzy logic systems 
thus developed, are evaluated and a fitness value is 
given to each string. We define a satisfactory hierar- 
chical fuzzy logic system as one whose fitness value 
(predicated interest rate) differs from the desired output 
of the system (in this case the actual interest rate) by 
a very small value. A calculated the average error of 
the system was used for the training set and tests sets 
using the following formula [Mohammadian, M. and 
Stonier, R. J. (1998)] 



Y,cibs(Pi-Ai) 



E = 



where E is the average error, Pi is the Predicted interest 
rate at time period z, Ai is the actual interest rate for 
the quarter and n is the number of quarters predicted. 
By using genetic algorithms to design and develop 
hierarchical fuzzy logic system good results were ob- 
tained. The hierarchical fuzzy logic systems developed 
using genetic algorithms predict the interest rate to 
different degree of accuracy. It is however interesting 
to see that genetic algorithms is capable of providing 
different hierarchical fuzzy logic system structures 
for predicting the interest rate. It should be noted that 
genetic algorithm is also capable of finding the number 
of layers in hierarchical fuzzy logic system. 

Prediction of Unemployment Rate 

In [Mohammadian, M., Nainar, I. and Kingham, M. 
(1997)] a fuzzy logic system was developed for the 
supervised learning in predicting quarterly Unemploy- 
ment rate in Australia. The following economic indica- 
tors where used as input to the Fuzzy Logic system. 



The Unemployment Rate is the percentage of 
the labour force actively looking for work in the 
country. 

Interest Rate which is the indicator we are aim- 
ing to predict. The Interest Rate used here is the 
Australian Commonwealth government 10-year 
treasury bonds. 

Job Vacancies is where a position is available for 
immediate filling or for which recruitment action 
has been taken. 

Household Saving Ratio is the ratio of household 
income saved to households disposable income. 

Each input was split into five fuzzy sets giving 
a total of 625 rules. These rules form the fuzzy 
knowledge base of the system. A supervised learning 
strategy using of genetic algorithms [Mohammadian, 
M., Nainar, I. and Kingham, M. (1997)] was used to 
find the fuzzy knowledge base for the system. Using 
simulations it was shown that the fuzzy logic system 
is able to predict with a great deal of success the quar- 
terly unemployment rate. The results achieved proved 
that the supervised learning strategy used accurately 
predicted fluctuations in the unemployment rate, and 
any small errors in the prediction could be reduced by 
increasing the training data and allowing the learning 
algorithm to run longer. 

Electricity Load Prediction 

In [Mohammadian, M. and Jentzsch, R. (2005)] a hi- 
erarchical fuzzy logic system using genetic algorithms 
for the prediction and modelling of daily electricity load 
fluctuations. The system is further trained to model and 
predict electricity consumption for daily peak. There are 
a number of possible indicators that could be used to 
predict the electricity load. These indicators that were 
used in this hierarchical fuzzy logic system are 

Electricity load (is the past electricity consumption 

(hourly)), 
Predicted Minimum Temperature is the predicted 

minimum temperature, 
Predicted Maximum Temperature is the predicted 

maximum temperature, 
Actual Minimum Temperature is the actual predicted 

minimum temperature, 
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Actual Maximum Temperature is the actual predicted 

maximum temperature, 
Season is one of the four seasons in the year, 
Day of the week is one of the seven days of the 

week, 
Holiday is one of several public holidays in the year, 
Time of day is divided here in 48 parts each consisting 

of 30 minutes. 

The current electricity load is included in the input 
indicators to the system as the predicted electricity load 
is highly dependent on the current rate as there is only 
likely to be a fluctuation in the electricity load from 
current electricity load. The related indicators (inputs) 
are grouped together because of the common connection 
and relation among them such as temperature, time of 
day etc. These groups are as follows 

Predicted Temperature Group - This group contains 
Electricity Load, Predicted Minimum Tempera- 
ture, Predicted Maximum Temperature, Time 
of day. 

Actual Temperature Group -This group contains 
Electricity Load, Actual Minimum Temperature, 
Actual Maximum Temperature, Time of day. 

Season day Group -This group contains, Electricity 
Load, Season (a value from 1 to 4 representing 
each season), Day of the week (two values, one 
for weekdays and zero representing weekend), 
Public Holiday (two values, one representing a 
public holidays and zero representing a working 
day), Time of day. 

Using a hierarchical fuzzy logic system structure, it 
is possible to overcome this problem. The three groups 



created for the electricity load prediction each produce 
a predicted electricity load. These are then fed into the 
next layer of the hierarchy where the final predicted 
electricity load is found (see Figure 4). 

The total number of rules for the hierarchical fuzzy 
logic system is 1455. From simulation results it was 
found that the hierarchical fuzzy logic system is capable 
of making accurate predictions of the electricity load 
[Mohammadian, M. and Jentzsch, R. (2005)]. 



FUTURE TRENDS 

The grouping of input parameters of the systems con- 
sidered above was performed using expert knowledge. 
It would be interesting to use genetic algorithms to 
find out the relationships between the input parameters 
of such systems and compare the results obtained in 
this way with the grouping of parameters suggested 
by expert. 



CONCLUSION 

In this paper issues in the construction of a fuzzy logic 
system to model a complex (nonlinear) system, namely 
the decomposition into hierarchical/multilayered fuzzy 
logic sub-systems and the learning of fuzzy rules and 
internal parameters is considered. Whilst the decom- 
position into hierarchical/multi-layered fuzzy logic 
sub-systems reduces greatly the number of fuzzy rules 
to be defined and to be learnt, other issues arise such as 
the decomposition is not unique and that it may give 
rise to variables with no physical significance. For a 
problem with a large number of input variables, for 




Figure 4. Hierarchical fuzzy logic system for electricity load prediction 
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example, the problem of interest rate prediction, the 
non-uniqueness of the decomposition yields numerous 
different structures to examine in order to find one 
which in some sense, is the 'best' structure. 
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KEY TERMS 

Fusing Variables: Fusing variables is a method for 
reducing the number of rules in a fuzzy rule base. The 
variables are fused (combined) together before input 
into the inference engine, thereby reducing the number 
of rules in the knowledge base. 

Fuzzy Logic: Fuzzy sets and Fuzzy Logic were 
introduced in 1965 by Lotfi Zadeh as a new way to 
represent vagueness in applications. They are a gener- 
alisation of sets in conventional set theory. Fuzzy Logic 
(FL) aims at modelling imprecise models of reasoning, 
such as common sense reasoning for uncertain complex 
processes. A system for representing the meaning of 
lexically imprecise proposition in natural language 
structure through the proposition being represented 
as fuzzy constraints on a variable is provided. Fuzzy 
logic controllers have been applied to many nonlinear 
control systems successfully. Linguistic rather than crisp 
numerical rules are used to control the processes. 

Fuzzy Rule Base (Fuzzy If -Then rules): Fuzzy If- 
Then or fuzzy conditional statements are expressions 
of the form "If A Then B", where A and B are labels 
of fuzzy sets characterised by appropriate membership 
functions. Due to their concise form, fuzzy If-Then 
rules are often employed to capture the imprecise 
modes of reasoning that play an essential role in the 
human ability to make decision in an environment of 
uncertainty and imprecision. The set of If-Then rules 
relate to a fuzzy logic system that are stored together 
is called a Fuzzy Rule Base. 

Genetic Algorithms: Genetic Algorithms (GAs) are 
algorithms that use operations found in natural genetics 
to guide their way through a search space and are increas- 
ingly being used in the field of optimisation. The robust 
nature and simple mechanics of genetic algorithms make 
them inviting tools for search, learning and optimiza- 
tion. Genetic algorithms are based on computational 
models of fundamental evolutionary processes such as 
selection, recombination and mutation. 



Genetic Algorithms Components: In its simplest 
form, a genetic algorithm has the following compo- 
nents: 




1. 



2. 



3. 



4. 



Fitness - A positive measure of utility, called fit- 
ness, is determined for individuals in a population. 
This fitness value is a quantitative measure of how 
well a given individual compares to others in the 
population. 

Selection - Population individuals are assigned a 
number of copies in a mating pool that is used to 
construct a new population. The higher a popula- 
tion individual's fitness, the more copies in the 
mating pool it receives. 

Recombination - Individuals from the mating pool 
are recombined to form new individuals, called 
children. A common recombination method is 
one-point crossover. 

Mutation - Each individual is mutated with some 
small probability « 1.0. Mutation is a mechanism 
for maintaining diversity in the population. 



Hierarchical Fuzzy Logic Systems: The idea of 
hierarchical fuzzy logic control systems is to put the 
input variables into a collection of low-dimensional 
fuzzy logic control systems, instead of creating a single 
high dimensional rule base for a fuzzy logic control 
system. Each low-dimensional fuzzy logic control 
system constitutes a level in the hierarchical fuzzy 
logic control system. Hierarchical fuzzy logic control 
is one approach to avoid rule explosion problem. It 
has the property that the number of rules needed to 
construct the fuzzy system increases only linearly with 
the number of variables in the system 

Supervised Learning: Alearning method in which 
there are two distinct phases to the operation. In the first 
phase each possible solution to a problem is assessed 
based on the input signal that is propagated through the 
system producing output respond. The actual respond 
produced is then compared with a desired response, 
generating error signals that are then used as a guide 
to solve the given problems using supervised learning 
algorithms. 
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INTRODUCTION 

Support Vector Machines — SVMs — are learning ma- 
chines, originally designed for bi-classification prob- 
lems, implementing the well-known Structural Risk 
Minimization (SRM) inductive principle to obtain 
good generalization on a limited number of learning 
patterns (Vapnik, 1998). The optimization criterion 
for these machines is maximizing the margin between 
two classes, i.e. the distance between two parallel hy- 
perplanes that split the vectors of each one of the two 
classes, since larger is the margin separating classes, 
smaller is the VC dimension of the learning machine, 
which theoretically ensures a good generalization per- 
formance (Vapnik, 1998), as it has been demonstrated 
in a number of real applications (Cristianini, 2000). In 
its formulation is applicable the kernel trick, which 
improves the capacity of these algorithms, learning not 
being directly performed in the original space of data 
but in a new space called feature space; for this reason 
this algorithm is one of the most representative of the 
called Kernel Machines (KMs). 

Main theory was originally developed on the six- 
ties and seventies by V. Vapnik and A. Chervonenkis 
(Vapnik et al., 1963, Vapnik et al., 1971, Vapnik, 1995, 
Vapnik, 1998), on the basis of a separable binary clas- 
sification problem, however generalization in the use 
of these learning algorithms did not take place until 
the nineties (Boser et al., 1992). SVMs has been used 
thoroughly in any kind of learning problems, mainly in 
classification problems, although also in otherproblems 
like regression (Scholkopf et al., 2004) or clustering 
(Ben-Hur et al., 2001). 

The fields of Optic Character Recognition (Cortes 
etal., 1995) and Text Categorization (Sebastiani, 2002) 
were the most important initial applications where 
SVMs were used. With the extended application of 
new kernels, novel applications have taken place in 
the field of Bioinformatics, concretely many works 



are related with the classification of data in Genetic 
Expression (Microarray Gene Expression) (Brown et 
al., 1997) and detecting structures between proteins and 
their relationship with the chains of DNA (Jaakkola et 
al., 2000). Other applications include image identifica- 
tion, voice recognition, prediction in time series, etc. 
A more extensive list of applications can be found in 
(Guyon, 2006). 



BACKGROUND 

Regularization Networks (RNs), obtained from the 
penalization inductive principle, are algorithms based 
on a deep theoretical background, but their purely as- 
ymptotic approximation properties and the expansion 
of the solution function on a large number of vectors 
convert them in a no practical choice in its original 
definition. Looking for a more reduced expansion of the 
solution some researchers observe the good behaviour 
of the SVM, being able to consider a finite training 
set as hypothesis in its theoretical discourse as well 
as building the final solution by considering nested 
approximation spaces. 

As well regularization inductive principles as 
structural risk minimization establish inserting 'a 
priori' information on the shape of the solution with- 
out considering any assumption about the unknown 
probability density function relating working spaces. 
The regularization principle considers a regularizer or 
regularization operator ensuring find a good solution 
in asymptotic form on nested function spaces when 
the number of elements in the training set tends to be 
infinite. Besides, the SRM principle also is based on 
nested spaces but the solution is found by ensuring an 
upper bound for the risk functional considering only 
a finite set of empirical data. 

Both inference processes are obviously not equiva- 
lents, but their similarities have been projected on the 
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learning methods having their theoretical background 
on these principles, in such a form than a number of 
researchers approaching the learning problem from 
different perspectives are implied in establishing 
a common framework allowing to deal SVMs and 
RNs like particular cases of a more general learning 
methodology, let us call it Kernel Methods (Campbell, 
2000), when it is emphasized the key rule played by 
the kernel function generating the feature space, or 
Large Margin Classifiers (Cristianini, 2000), when the 
measure to be optimized to ensure maximal generaliza- 
tion is emphasized. 

Both, results obtained and theoretical framework 
from SRM seem offer better theoretical warranties that 
other previous approaches when looking for solutions 
with good generalization for problems based on a finite 
empirical set. Hence, the integration of machine learning 
methods on mixed models is a state of the art research 
field. Besides, the use of Bayesian inference in these 
mixed models is being avoided because the user must 
beforehand define a probability density function. 



SUPPORT VECTOR MACHINE 

Let us consider a bi-classification problem (another kind 
of problems are analyzed in the cited references) . Thus, 
let Z={z=(x.,y 1 ), i=l,2, ...,n} be a training set with x. e 
XciR d diS the input space andy. e {0 1? 2 } (the output 



space) (9 : ^ 2 ). Let us initially suppose that classes 
are linearly separable (two sets are linearly separable 
in n-dimensional space if they can be separated by 
an d- 1 dimensional hyperplane) then a hyperplane, 
denoted by7i:wx-b = (where b is called bias), 
is sought which separates the two classes, that is w x. 
- b > if y .= 0, and w x. - b < if y .= 0,. Neverthe- 

J I 1 i J i 2 

less, there are many hyperplanes with this condition 
(see Figure 1), so a new condition is imposed that is 
the distance between the optimal hyperplane and the 
nearest training pattern (margin) is maximal. Let us 
see detailed this condition: In the first place without 
lost of majority let us suppose that 9 X = 1 and 2 = -1. 
Hence, let |3 and a be the minimum (class +1) and the 
maximum (class -1) absolute values of the unbiased 
hyperplane effectively attained for some patterns z x e 
Z x and z 2 e Z 2 i.e. 

a = max w x { an( j (3 = min w x ; 

where Z 1 and Z 2 are the patterns belonging to the classes 
labelled as {+ 1 ,- 1 } respectively. It is considered that a 
< (3, otherwise vector -w is chosen. Thus, given a vec- 
tor w, the margin is defined as the distance between 
parallel hyperplanes n a : w x - a = and n : w x - |3 
= 0, that is 




margin = d (n a ,n a) = 



P-cc 



w 



Figure 1. Type A denotes the class +7 (6 J and Type B denotes the class -1 (6J. 



f Oo ° Type A 




J o ° Type A 





Type B £° Q 




°TypeBS D D \X gil>h 



□ 



\ 



1519 



Support Vector Machines 



(see Figure 1). The natural choice for the bias, ensur- 
ing positive and negative outputs for the patterns in 
the respective classes, is 



The solution vector can be written as 



b- 



oc + P 



The maximization of the margin has the objective to 
force the generalization of the found learning machine 
(Vapnik, 1995, Scholkopf et al., 2002). 

The extension to non-lineal functions of decision 
is carried out introducing the input space X cz R d in 
another space, usually with higher dimension F, called 
feature or characteristics space which is endowed with 
an inner product, through a non-lineal injection, § : X 
— » F (this procedure is called kernel trick), such that 
the optimal hyperplane 

f(x,w) = Hq>(x),w> F -b 

is sought in the feature space F. Nevertheless, with 
the objective of defining in a unique way the searched 
hyperplane (canonical form) next restrictions should 
be added: 

y,Kx)>l-^ i = l,2,...,n 

on the training set Z, where the slack variables ^. > are 
introduced to allow that some examples exist violating 
the constraint imposed by the margin (soft-margin) 
because it should be considered the possibility that the 
classes to be separated are overlapped or that patterns 
contain noise that is the set Z can be a non-separable 
linearly. Hence, the fimctionf(x,w) allows defining the 
decision function as 

h(x) = sign(f(x,w)) 

that is, given a new input x the label assigned by the 
machine is 1 if h(x)=l and 2 otherwise. 

Thus the optimal hyperplane accomplishes the fol- 
lowing problem of constrained optimization: 



Min 



s.a. 



n II II A^~ l 

z i=l 

y,f(x,.)>l-^ i=l,2,--,n 
^>0 i=\2,-,n 



w = 5Xw|>(x,) 



(1) 



where SV is the number of training vectors which 
verify that their corresponding Lagrange multiplier 
a. is no null (these vectors are called support vectors) 
(see Figure 1). Many other different approaches for 
defining SVM exist (Gonzalez et al., 2006), neverthe- 
less this formulation is the most usual. 

From the equation (1), the optimal hyperplane can 
be written as: 

SV 

f(x) = ^]a / y / /c(x / ,x)-b 



where k(x., x) = (§ (x.), §(x)) F is a Kernel (a bivariate 
function accomplishing the Mercer's theorem) and b 
is calculated by using the Karush-Kuhn-Tucker (KKT) 
conditions. 

For multi-classification problems, a set of possible 
labels Y = {0 1? ..., 0^} with £ > 2 is considered. There 
are two main SVM-based approaches to solve these 
problems. A first one is the "all the classes at once", 
which solves these problems by considering all the 
instances from all the classes in a unique optimization 
formulation, whereas the other one is the "decomposi- 
tion-reconstruction" architecture approach (multi-clas- 
sification in two phases), using binary SVMs. 

In the first case, several formulations exist (Vapnik, 
1998, Cramer, 2001, Aiolli, 2005), however among 
all of the proposed approaches to the maximal margin 
problem, that presented in Shashua et al. (2002) is the 
only one considering to maximize the exact expression 
of the margin between instances with different label, so 
the multi-classification problem is interpreted like an 
ordinal regression problem where the objective func- 
tion is the sum of the inverse of the margins between 
classes. 

In the case of multi-classification in two phases, the 
most usual multi-classification SVM approaches are 1 - 
v-1 (one-versus-one) SVM and 1-v-r (one-versus-rest) 
SVM. In both approaches, a first decomposition phase 
generates several learning machines in parallel and a 
reconstruction scheme allows obtaining the overall 
output by merging outputs from the decomposition 
phase. 
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In the first phase of 1 -v-r S VM, each machine takes 
in consideration all the classes; £ binary classifiers are 
trained to generate hyperplanes f k , (k= 1,2,..., £) separat- 
ing training vectors with label Q k from the remaining 
vectors. In the reconstruction phase (second phase), a 
labels distribution generated by the trained machines 
in the parallel decomposition is considered through a 
merging scheme. All the information provided by the 
training vectors is considered, main drawback being that 
it is not well designed to separate specific classes. 

In the first phase of 1 -v- 1 S VM, each machine takes 
in consideration only two classes. In this approach, 



-1) 



binary classifiers are trained to generate hyperplanes 
f kh , k y h- 1,2,..., £, k<h separating training vectors with 
label Q k from training vectors in class h . Remaining 
training vectors are not considered in the optimization 
problem. In the reconstruction phase, a labels distribu- 
tion generated by the trained machines in the parallel 
decomposition is considered through a merging scheme. 
Main drawback is that only data from two classes are 
considered for each machine in the decomposition pro- 
cedure so output variance is high and any information 
from the rest of classes is ignored. 



The 1-v-l scheme is usually preferred because it 
takes less training time (Kressel, 1999} although some 
researches consider the 1-v-r scheme since this scheme 
has some advantages (Rifkin et al., 2004). Nevertheless 
according to (HsuLin2002) it would be difficult to say 
which one gives better accuracy. 



FUTURE TRENDS AND CONCLUSION 

A recursive problem when considering SVM is to 
reduce the computational cost when the QP problem 
is being solved. 

SVM for classification is the most studied approach; 
however other problems as regression are not enough 
developed to be competitive versus other standardized 
research areas, like artificial neural networks. So, a long 
way still remains to be walked in this area. 

A particular open research problem in the case of 
multi-classification is the implementation of the tri- 
class scheme. In this approach, one class is label as 
+1, another class as -1 and the rest of the classes as 
which is forced to be encapsulated into a 5-tube, 0<5< 1 , 
along the separation hyperplanes. The tri-class SVM 
improves standard algorithms treating 2-class classi- 
fication problems during the decomposing phase of a 
general multi-class scheme by focusing the learning on 




Figure 2. The circles are the class +7, the rectangles are the class and the triangles are the class -1. 
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two classes, but using all the available information on 
the patterns (see Figure 2), so this approach can be seen 
as a mixed between the 1-v-rand 1-v-l SVM. A second 
theoretical advantage of the "third-class approach" is 
the robustness of the reconstruction procedure (Angulo 
et al., 2003), which could drive to empirically expect 
a higher performance of the new approach in terms of 
accuracy (Angulo et al., 2006). Research should even 
include the study of theoretical generalization bounds 
for this kind of machine. 



REFERENCES 

Angulo, C. & Parra, X. & Catala, A. (2003). K-SVCR. 
A Support Vector Machine for Multi-Class Classifica- 
tion. Neurocomputing, 55(1-2), 57-77. 

Angulo, C. & Ruiz, F. & Gonzalez, L. & Ortega, J. A. 
(2006). Multi-classification by using Tri-class SVM. 
Neural Processing Letters, 23(1), 89-101. 

Aiolli, F. & Sperduti, A. (2005). Multiclass Classification 
with Multi-Prototype Support Vector Machine. Journal 
of Machine Learning Research, 6, 817 — 850. 

Ben Hur, A. & Horn, D. & Siegelmann, H. & Vapnik, V 
(2001). Support Vector Clustering. Journal of Machine 
Learning Research, 2, 125-137. 

Boser, B.E. &Guyon, I. & Vapnik, V (1992). Atraining 
algorithm for optimal margin classifiers. Proceedings 
of the 5th Annual ACM Workshop on Computational 
Learning Theory, 144-152. 

Brown, M. & Grundy, W. & Lin, D. & Cristianini, 
N. & Sugnet, C. & Furey, T. & Ares, M. & Haussler, 
D. (1997). Knowledge-based Analysis of Microarray 
Gene Expression Data Using Support Vector Machines. 
Proceedings of the National Academy of Sciences, 
97(1), 262-267. 

Campbell, C. (2000). An introduction to kernel meth- 
ods. In Howlett, R. and Jain, L. editors, Radial Basis 
Function Networks: Design and Applications, Berlin, 
Springer Verlag. 

Cortes, C. & Vapnik, V (1995). Support vector net- 
works. Machine Learning, 20, 273-297. 

Crammer, K. & Singer, Y. (2001). On the Algorithmic 
Implementation of Multiclass Kernel-based Vector 



Machines. Journal of Machine Learning Research, 
2, 265-292. 

Cristianini, N. & Shawe-Taylor, J. (2000). An introduc- 
tion to Support Vector Machines and other kernel-based 
learning methods. Cambridge University Press. 

Hsu, C. & Lin, C. (2002). A comparison of methods for 
multiclass support vector machine. IEEE Transactions 
on Neural Networks, 13(2), 415-425. 

Gonzalez, L. & Angulo, C. & Velasco, F. & Catala, 
A. (2006). Dual unification of bi-class Support Vector 
Machine formulations. Pattern Recognition, 39(7), 
1325-1332. 

Guyon, I. (2006). SVM application list. http://www. 
clopinet.com/isabelle/projects/svm/applist.html 

Jaakkola, T. & Diekhans, M. & Haussler, D. (2000). A 
discriminative framework for detecting remote protein 
homologies. Journal of Computational Biology, 7(1- 
2), 95-114. 

Kressel, U. (1999). Pairwise classification and support 
vector machine. Advances in Kernel Methods: Sup- 
port Vector Learning. MIT Press. Cambridge, MA, 
255-268. 

Rifkin, R. & Klautau, A. (2004). In defense of one- 
vs-all classification, Journal of Machine Learning 
Research, 5, 101-141. 

Shashua, A. & Levin, A. (2002). Taxonomy of Large 
Margin Principle Algorithms for Ordinal Regression 
Problems. Neural Information Processing Systems, 
16. 

Scholkopf, B. & Smola, A.J. (2002). Learning with 
Kernels. The MIT Press. Cambridge, MA. 

Scholkopf, B. & Smola, A.J. (2004). A Tutorial on 
Support Vector Regression, Statistics and Computing, 
14, 199-222. 

Sebastiani, F. (2002). Machine learning in automated 
text categorization. ACM Computing Surveys, 34(1), 
1-47. 

Vapnik, V (1995). The nature of statistical learning 
theory. Springer. New York. 

Vapnik, V (1998). Statistical Learning Theory. John 
Wiley & Sons, Inc. 



1522 



Support Vector Machines 



Vapnik, V. & Chervonenkis, A. (1971). On the uniform 
convergence of relative frequencies of events to their 
probabilities. Theory of Probability and Its Applica- 
tions, 16, 264-280. 

Vapnik, V. & Lerner, A. (1963). Pattern recognition 
using generalized portrait method. Automation and 
Remote Control, 24. 



KEY TERMS 

Generalization: It is the process of formulating 
general concepts by abstracting common properties of 
instances. In the context-specific of the SVMs means 
that if a decision function h(x) is obtained from x e 
X a R d , this function is considered valid for all x e 
R d . It is the basis of all valid deductive inference and 
a process of verification is necessary to determine 
whether a generalization holds true for any new given 
instance. 

Kernel Machine or Kernel Methods: Kernel ma- 
chine owe their name to the use of kernel functions that 
enable them to operate in the feature space without ever 
computing the coordinates of the data in that space, but 
rather by simply computing the inner products between 
the images of all pairs of data in the feature space. This 
operation is often computationally cheaper than the ex- 
plicit computation of the coordinates. Kernel functions 
have been introduced for sequence data, text, images, 
as well as vectors (Scholkopf et al., 2002). 



Kernel Trick: This procedure consists on substitut- 
ing the inner product in the space of input variables by 
an appropriate function k(x,x') that associates to each 
two inputs an real number real that is k(x,x') e R, for 
any x,x' e X. Thus, this is a method for converting a 
linear classifier algorithm into a non-linear one by using 
a non-linear function to map the original observations 
into a higher-dimensional space; this makes a linear 
classification in the new space equivalent to non-linear 
classification in the original space. 

Mercer's Kernel: A function lclxl^i?isa 

Mercer's kernel if it is a continuous, symmetric and 
positive semi-definite function (Cristianini, 2000). 

Regular ization: It is any method of preventing 
overfitting of data by a model and it is used for solving 
ill-conditioned parameter-estimation problems. 

Structural Risk Minimization (SRM) Inductive 
Principle: The main idea of the principle is to minimize 
a test error by controlling two contradictory factors: a 
risk functional from empirical data and a capacity for 
the set of real-valued functions (Vapnik, 1998). 

The VC Dimension (for Vapnik-Chervonenkis 
Dimension): It is a number which is defined as the 
cardinality of the largest set of points that an algorithm 
can shatter, that is it is a measure of the capacity of 
a statistical classification algorithm (Vapnik et al., 
1971). 
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INTRODUCTION 

Automated negotiation is a very challenging research 
field that is gaining momentum in the e-business domain. 
There are three main categories of automated negotia- 
tions, classified according to the participating agent 
cardinality and the nature of their interaction (Jennings, 
Faratin, Lomuscio, Parsons, Sierra, & Wooldridge, 
2001): the bilateral, where each agent negotiates with a 
single opponent, the multi-lateral which involves many 
providers and clients in an auction-like framework and 
the argumentation/persuasion-based models where the 
involving parties use more sophisticated arguments to 
establish an agreement. In all these automated negotia- 
tion domains, several research efforts have focused on 
predicting the behaviour of negotiating agents. This 
work can be classified in two main categories. The 
first is based on techniques that require strong a-priori 
knowledge concerning the behaviour of the opponent 
agent in previous negotiation threads. The second 
uses mechanisms that perform well in single-instance 
negotiations, where no historical data about the past 
negotiating behaviour of the opponent agent is avail- 
able. One quite popular tool that can support the latter 
case is Neural Networks (NNs) (Haykin, 1999). 

NNs are often used in various real world applica- 
tions where the estimation or modelling of a function 
or system is required. In the automated negotiations 
domain, their usage aims mainly to enhance the per- 
formance of negotiating agents in predicting their 
opponents' behaviour and thus, achieve better overall 
results on their behalf. This paper provides a survey of 
the most popular automated negotiation approaches that 



are using NNs to estimate elements of the opponent's 
behaviour. 

The rest of this paper is structure as follows. The 
second section elaborates on the state of the art bilateral 
negotiation frameworks that are based on NNs. The 
third section briefly presents the multilateral negotiation 
solutions that exploit NNs. Finally, in the last section 
a brief discussion on the survey is provided. 



NEURAL NETWORKS IN BILATERAL 
NEGOTIATIONS 

In (Zhang, Ye, Makedon, & Ford, 2004) a hybrid bilat- 
eral negotiation strategy mechanism is described that 
supplies negotiation agents with more flexibility and 
robustness in an automated negotiation system. The 
framework supports a dynamically assignment of an 
appropriate negotiation strategy to an agent according 
to the current environment, along with a mechanism 
to create new negotiation rules by learning from past 
negotiations. These learning capabilities are based on 
feedforward back-propagation neural networks and 
multidimensional inter-transaction association rules. 
However, the framework is not adequately described 
and defined and the neural networks are not specifically 
instantiated. Additionally, there are neither quantitative 
nor qualitative experimental results for real world cases. 
Finally, the format of the input to the generic network 
that is presented is ambiguously described. 

In (Zeng, Meng, & Zeng, 2005), the authors employ 
a neural network to assist the negotiation over very 
specific issues from a real world example. The network 
is trained online by the past offers made by the op- 
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ponent, while both the buyer and the seller agent have 
the ability to employ the proposed network. However, 
the experimental data sets are very restrictive and do 
not address the diversity of those that can be arisen in 
real scenarios. Additionally, the authors do not present 
the actual size of the hidden layer, a parameter that is 
extremely crucial with regards to the appropriateness to 
use such a network in a real time negotiation procedure 
by an agent with limited resources. 

Furthermore, in (Rau, Tsai, Chen, & Shiang, 2006), 
the authors studied the negotiation process between a 
shipper and a forwarded using a learning-based ap- 
proach, which employed a feedforward back-propaga- 
tion neural network with two input data models and the 
negotiation decision functions. Issues of the negotiation 
were the shipping price, delay penalty, due date, and 
shipping quantity. The proposed mechanism was ap- 
plicable to both parties at the same time and the network 
architecture was chosen based on past similar attempts, 
following a very restrictive pattern for the number of 
the hidden layer's neurons. The conducted experiments 
showed an overall improvement of the results for both 
negotiating parties, while the framework was proven 
stable and with small deadlock probability. However, as 
its authors support, further experimentation is required 
especially with regards to a wider variety of strategies 
and possibly more suitable network architectures for 
the hidden layer. 

In (Carbonneau, Kersten, & Vahidov, 2006), a neural 
network based model is presented for predicting the 
opponent's offers during the negotiation process. The 
framework was tested over a specific set of experi- 
mental data collected from other existent frameworks 
and it is highly adjusted to these data. The purpose of 
this solution is not only to predict the opponent's next 
offer, but also the perception for the specific procedure, 
i.e. an overall vision on why everything is happening 
and where the procedure is led. Thus, the prediction 
of the opponent's next round offer is only a part of the 
network's output. However, the chosen experiment set 
is constrained and doesn't examine the effectiveness of 
the framework on diverse strategies as those proposed 
in the very first steps of the area and are now mainly 
used (Faratin, Sierra, & Jennings, 1998). Addition- 
ally, although the authors support the view that their 
framework is proper for real-time environments, the 
fact is that the resulted network is difficult to be online 
trained, mainly because of its size and the resources 
that are required for such training. Thus, this network 



architecture is probably inappropriate for mobile agents ' 
environments, and something smaller and more specific 
should be designed, due to the limitations that these 
environments share. 

Moreover, in (Oprea, 2003), the author presents 
a shopping agent, which is capable of negotiating in 
online bilateral, multi-issue procedures using an offline 
created and trained feedforward neural network in order 
to increase its profitability by adapting its behaviour 
according to its opponent's. The purpose of the neural 
network's application on each procedure is to predict 
the opponent's next offer on a round by round basis 
and thus, model its behaviour and intentions in order 
to finally achieve a better or even the best possible 
deal. With the exploitation of the neural network the 
shopping agent can decide during the online phase of 
negotiation, which is the opponent's strategy and esti- 
mate its reservation value. Concerning the experiments 
conducted, the author uses the well-justified negotiation 
tactics presented in (Faratin, Sierra, & Jennings, 1998) 
in order to test the proposed solution and concludes that 
the framework is working well in case of medium or long 
term agents' deadlines. However, the results presented 
are not thoroughly justified and more extreme opponent 
strategies should be tested in order to decide on the 
network's adequacy for such environments. Probably, 
the three hidden layer neurons might not be sufficient 
for such cases and long-term estimations. 

Finally, Papaioannou et al. have recently designed 
and evaluated several single-issue bilateral negotiation 
approaches, where the Client agent is enhanced with 
Neural Networks. More specifically, in (Roussaki, 
Papaioannou, & Anagnostou, 2006), the Client agent 
uses a lightweight feedforward back-propagation NN 
coupled with a fair relative tit-for-tat imitative tactic, 
and attempts to estimate the Provider's price offer upon 
the expiration of the Client's deadline. This approach 
increases the number of agreements reached by one third 
in average. In (Papaioannou, Roussaki, & Anagnostou, 
2006), the performance of MLP and RBF NNs towards 
the prediction of the Provider's offers at the last round 
has been compared. The experiments indicate that the 
number of agreements is increased by ~38% in aver- 
age via both the MLP- and the RBF-assisted strategies. 
Nevertheless, the overall time and the number of neurons 
required by the MLP are considerably higher than these 
required by the RBF. In (Roussaki, Papaioannou, & 
Anagnostou, 2007), MLP and GR NNs have been used 
by the Client agent in order to identify the unsuccess- 
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ful negotiation threads (UNTs) at an early stage, thus 
terminating them long before the deadlines expire. It 
has been observed that the MLP NN detects more than 
90% of UNTs in average, outperforming by little the GR 
NN. Finally, in (Papaioannou, Roussaki, & Anagnostou, 
2007), the performance of MLP and RBF NNs has been 
compared with cubic splines, least-square-based poly- 
nomial approximators, exponential approximators and 
Gaussian approximators, in order to predict the future 
offers of the negotiating Provider Agent. The wide 
experimental evaluation conducted indicates that both 
the MLP- and the RBF-assisted negotiation strategies 
perform almost equally well and outperform the other 
four approximator-assisted strategies. In this paper, the 
proposed framework is extended to address multi-issue 
negotiations considering the significance of the issues 
under negotiation for the negotiating party, as well as 
their degree of interdependency. A disadvantage in the 
aforementioned NN-based negotiation frameworks is 
that they have only been evaluated in case the Provider 
agent adopts a time-dependent strategy. 



NEURAL NETWORKS IN 
MULTILATERAL NEGOTIATIONS 

In (Oprea, 2001), the use of a small-scaled feedfor- 
ward neural network is attempted in order to predict 
the opponent agent's behaviour. In this framework 
the enhanced agent is negotiating against an opponent 
that is not equipped with any learning or other intel- 
ligent mechanism. The neural network is properly 
constructed and trained at every round to respond with 
the opponent's next value at each negotiation step us- 
ing only the three prior offers issued by the opponent. 
This fact makes the step-by-step computation feasible 
in real time procedures, but not necessarily reliable. 
However, the proposed approach was proved adequate 
only in cases when either both agents (or at least the 
opponent agent) have long-term deadlines. 

A different usage of the neural networks ' potential is 
presented in (Shibata, & Ito, 1999), where the authors 
are mainly concerned with the communication between 
agents. In principal, they divide the agents' commu- 
nication into two classes with respect to its meaning. 
The first one incorporates the cases where the agent 
transmits the observed information while the second 
those where the agent's intention is transmitted. The 
framework exploits an Elman recurrent neural network 



with feedback loops, especially for the latter class of 
cases. The network assists the agents to avoid possible 
negotiation deadlocks, although nothing is known 
apriori with regards to their strategy or resources. The 
network keeps the past information and adapts online 
its corresponding agent's behaviour accordingly in 
order to avoid collisions. The proposed framework 
was also tested with four agents leading to promising 
results. However, the authors don't propose or apply 
techniques for higher profitability of the participating 
agents but only for collision avoidance by learning the 
opponent's intention. Additionally, a recurrent neural 
network is a complex structure and seems inappropriate 
for application in low resources agent environments. 

Furthermore, in (Abreu, Canuto, & Santana, 2005) 
the authors present a comparative analysis of some ne- 
gotiation methods used in a multi-neural agent system, 
called NeurAge. This system is composed of several 
neural classifiers, called neural agents, and its main 
aim is to overcome some drawbacks of multi-classi- 
fier systems and, as a consequence, to improve their 
performance. These neural agents provide a common 
output, which results after negotiation among them 
and it is the system's output. For this purpose, three 
different negotiation methods are evaluated: the game 
theoretic, the auction based and the confidence based 
ones. The results prove that the proposed approach is 
valuable for such classifier systems and might end up 
being valuable in cases where tactic classification should 
be conducted. However, the system is inappropriate 
for online procedures, requires cooperation between 
multiple neural agents and has not been tested on real 
negotiation tactics' numerical data. Therefore, the re- 
sults might be valuable when a classification scheme 
is required, but are probably inappropriate as a future 
prediction pattern. 

On the other hand, (Veit, & Czernohous, 2003) 
present the results of enhancing consumer agents with 
several machine-learning algorithms in a properly de- 
signed electronic market with one static supplier. The 
results prove that under very specific circumstances 
the neural network assisted agent performs worse than 
a simple Q-learning assisted agent that maintains a 
specific set of values for the learning procedure in an 
a-priori instantiated matrix. However, the scenarios 
are very restrictive and in no case address the charac- 
teristics of real world ones where the application of 
similar table based agents would fail mainly due to 
the diversity of the potential solution spaces for each 
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negotiation. Besides, the authors themselves admit this 
remark, including it in their future plans. 

In (Park, &Yang, 2006), the authors propose a nego- 
tiation agents system based on the incremental learning 
of a feedforward neural network in order to increase 
the efficiency of bilateral negotiations and to improve 
the applicability towards multilateral negotiations. The 
network is triggered with values that are extracted after 
a utility evaluation procedure and at each round the 
output is forming the next counter-offer of the party. 
With regards to the generalization to the multilateral 
case, the proposed approach is based on matching all 
sellers and all buyers in pairs among all possible ones, 
following practical criteria as the common negotiation 
range term used, indicates. The experimental results 
show that the proposed system achieves up to 2% more 
agreements and carries out the negotiations at least 
twice as fast as others with similar settings. 

In (Wang, Chai, & Huang, 2005), the authors attempt 
to solve the problem of selecting a selling agent that 
meets buyer user's requirements as well as his utility 
constraints as those represented by the corresponding 
intelligent agent. The problem is solved by choosing 
the seller before the negotiation and thus, the accuracy 
of the negotiation and the buyer's utility are improved. 
In order to fully utilize negotiation history, this paper 
transforms the problem of choosing seller into a K- 
armed bandit problem. The utility function is a joint 
summation of the utilities of both the buyers and the 
sellers, while the buyer uses a properly learned neural 
network in order to learn its opponents' preferences 
and finally choose the one that will lead to the best 
agreement. The advantage of this framework is that the 
buyer's neural network learns off-line and only uses 
the results for the online procedure. Thus, there is not 
substantial impact on the real procedure. 

Finally, in (Liu, & You, 2003), a fuzzy neural 
network is proposed to deal with the uncertainties 
in real world shopping activities, such as consumer 
preferences, product specification, product selection, 
price negotiation, purchase, delivery, after-sales service 
and evaluation. The fuzzy neural network manages to 
achieve an automatic and autonomous product clas- 
sification and selection scheme to support fuzzy deci- 
sion-making by integrating fuzzy logic technology and 
the back-propagation feedforward neural network. In 
addition, a visual data model is introduced to overcome 
the limitations of the current web browsers that lack 
flexibility for customers to view products from different 



perspectives. The experimental results demonstrate 
the feasibility of the proposed approach for web-based 
business transactions. 



CONCLUSION AND DISCUSSION 

In this paper, a brief survey of the most popular re- 
search efforts in the field of NN-assisted automated 
negotiations is presented. An important observation 
that can easily be made is that that there is a substan- 
tial diversity on the purposes that the NNs are used 
for in this domain. For instance, in some cases they 
aim to estimate the opponent's future offers, whereas 
in other cases they assist the negotiating agent on 
selecting the best tactic that should be used in order 
to increase its potential utility. Even though the usage 
of NNs in automated negotiations may enhance vari- 
ous aspects of their performance and results, there are 
some cases where they are not suitable. For example, 
they perform far better when they are trained off-line, 
thus being less suitable when no a-priori knowledge 
is available. In general, it is preferable that relatively 
small NNs that are trained off-line are used, but if this 
is not possible, it is better to use NNs of minimal size 
that are trained on-line, risking however that they will 
eventually not be suitable enough. Furthermore, if the 
negotiation strategy of the opponent is not consistent, 
thus frequently demonstrating sharp changes in the 
type or configuration of the tactic used, the NNs often 
fail to adjust. In case the opponent employs imitative 
negotiation strategies, the usability of NNs in estimat- 
ing the opponent's behaviour is questionable. Finally, 
if the agent has low storage and processing resources 
available, the NNs that can be employed need to be 
so lightweight that they considerably lack flexibility. 
Despite these shortcomings, it is expected that NNs 
will gain a considerable share in the learning-enabled 
negotiating agents in the electronic marketplace. 
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KEY TERMS 

Automated Negotiation: It is the process by which 
group of actors communicate with one another aiming 
to reach to a mutually acceptable agreement on some 
matter, where at least one of the actors is an autonomous 
software agent. 
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Bilateral Negotiation: A negotiation procedure, 
where exactly two parties are involved, i.e. a client 
and a provider. 

Multilateral Negotiation: Anegotiation procedure, 
where more than two parties are involved, i.e. multiple 
clients and/or providers negotiate simultaneously. 

Multi-Layer Perceptron (MLP): A fully con- 
nected feedforward NN with at least one hidden layer 
that is trained using back-propagation algorithmic 
techniques. 

Neural Network (NN): A network modelled after 
the neurons in a biological nervous system with multiple 
synapses and layers. It is designed as an interconnected 
system of processing elements organized in a layered 
parallel architecture. These elements are called neu- 
rons and have a limited number of inputs and outputs. 
NNs can be trained to find nonlinear relationships in 
data, enabling specific input sets to lead to given target 
outputs. 

Radial Basis Function (RBF): Function that 
involves a distance criterion with respect to a centre, 
such as a circle, ellipse or Gaussian. 

RBF NN: It is an artificial NN, the activation func- 
tions of which are radial basis functions. It has two 
layers of processing, where the first maps the input onto 
each RBF neuron in the other (hidden) layer. 
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INTRODUCTION 

Wireless ad-hoc networks are inf rastructureless and they 
consist of nodes that come together and start commu- 
nicating dynamically without requiring any backbone 
support. The nodes can enter and leave the network at 
will and can move about in the network at will. 

Ad-hoc networks present the perfect test-beds 
for bio-inspired computing algorithms. Both ad-hoc 
networks and bio-inspired computing approaches are 
characterized by self-organization, feedback and struc- 
tural and functional complexity (Toh, 2002) (deCastro 
& Von Zuben, 2005). Hence, bio-inspired algorithms 
often provide us an opportunity to solve the most 
complex problems of ad-hoc networks in a satisfactory 
manner. In this chapter, we present the works done in 
the field of ad-hoc networks using bio-inspired Swarm 
Intelligence (SI). In particular, we look at how we can 
use Ant Colony Optimization (ACO) technique, a SI 
technique, for optimal routing in ad-hoc networks. 



BACKGROUND 

Most maj or bio-inspired algorithms have found imple- 
mentations in the field of ad-hoc networks. Before 
delving into the details of the applications of ACO 
techniques for solving problems in ad-hoc networks, 
for contextual alignment, let us broadly review some of 
the applications of the different classes of bio-inspired 
algorithms to ad-hoc networks. Barolli, Koyama, & 
Shiratori (2003) presented a Genetic Algorithm to 
solve QoS routing for ad-hoc networks, while Di Caro, 
Ducatelle, & Gambardella (2004) used ACO technique 
to develop a nature inspired routing algorithm for ad-hoc 



networks. On similar lines, Wedde & Farooq (2005) 
presented BeeAdHoc - a routing algorithm inspired 
from foraging behavior of honey-bees. Neural Networks 
too have been extensively used in ad-hoc networks for 
providing solution to the problems of routing (Vicente, 
Mujica, Sisalem, & Popescu-Zeletin, 2005), intrusion 
detection (Zhang & Lee, 2000) and clustering (Ai-bin, 
Zi-xing, & De-wen, 1993). 

ACO is one of the most popular techniques among 
different bio-inspired techniques and has been exten- 
sively studied and deployed for solving problems as 
varied as Vehicular Routing Problem (VRP) (Toth & 
Vigo, 2001) to Single Machine Total Weighted Tardi- 
ness Problem (SMTWTP) (Abdul-Razaq, Potts & Van 
Wassenhove, 1990) to Graph Colouring Problem (Vesel 
& Zerovnik, 2000). 

ACO was introduced by Marco Dorigo in his PhD. 
thesis as Ant System (AS). It was initially aimed at 
solving the popular Travelling Salesperson's problem. 
Though the solution provided by AS was suboptimal 
when compared with other specialized solutions, it 
underlined a method which models the foraging be- 
haviour of ants to solve complex problems of computer 
science. 



MAIN FOCUS OF THE CHAPTER 

As mentioned earlier, the primary focus of this Chapter 
is to illustrate the applications of bio-inspired ACO 
techniques in the field of ad-hoc networks. We start by 
introducing the concepts of SI. Then, ACO concepts and 
their implementations in ad-hoc networks are discussed 
in detail. We first present and explain properties of ant 
colony that can enable to find the shortest path between 
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the source of the food of the ants and their colony using 
the concept of pheromone. We explain the concept of 
artificial ants and then present the Random Proportional 
Transition Rule (Dorigo & Stutzle, 2003). We, then, 
describe in detail, the AntHocNet algorithm (Di Caro, 
Ducatelle, & Gambardella, 2004), which uses the ACO 
technique for routing data in ad-hoc networks. 



SWARM INTELLIGENCE 

Social insects such as ants, bees, wasps and termites 
and organisms such as fishes and birds rely on local 
communication to achieve distributed control. While 
insects such as ants, bees and termites rely on indi- 
rect communication through environment (also often 
referred to in the literature as Stigmergy), birds are 
dependent on direct but localised communication. 

Nonetheless, all of these techniques aim at devel- 
oping a system in which each element of the system 
works together to establish autonomy. The elements 
co-operate with each other locally to make the system 
much more adaptable and robust to changes and errors. 
Since these are the main aims of the design of ad-hoc 
networks, SI algorithms are effectively employed for 



solving routing and Quality-of- Service (QoS) problems 
in ad-hoc networks. 

ACO 

Natural ants have a property that they always find the 
shortest path to the food source from their nests. This 
property can be illustrated by the experiments explained 
in (Goss, Aron, Deneubourg, & Pasteels, 1989) and 
(Deneubourg, S., S., & J., 1990). The set up of the 
experiments is illustrated in Figure 1 and Figure 2. 

In Figure 1, the path between the nest and the food 
source are equal. It was found that roughly 50% of the 
ants were using each path. On the other hand in the 
set up shown in Figure 2, when the paths are unequal, 
it was found that after some time nearly all the ants 
were using the smaller path. This phenomenon can be 
explained using the following argument. 

It was found that the ants mark the path that they 
take by a chemical named pheromone, thereby guid- 
ing other ants to take that path. Its implication is that 
the ants choose a path on the basis of the amount of 
pheromone lying on that path. In Figure 2, when the 
first group of ants start from their nest, they choose each 
path with equal probability. So, about half of the ants 
start moving on each path. The ants using the smaller 




Figure 1. Same path lengths 1 




Figure 2. Different path lengths 2 
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path reach point B first. Since no pheromone is present, 
the group again divides into two with equal number 
of ants turning towards the nest and other half going 
back via the longer path. When the ants coming from 
the longer path (which remain unperturbed by the ants 
coming back from that path) reach point B, they face a 
choice between going towards the food source or go- 
ing back towards point A using the smaller path. Since 
pheromone on path to the food source is smaller than 
that on the path going back to point A, more ants use 
this path, thereby increasing the pheromone quantity 
on shorter path even further. When the second group 
of ants start from the nest and reach point A, more ants 
chose the smaller path since the pheromone value on 
that path is higher thereby increasing the pheromone 
quantity even further. This can be considered as spe- 
cial type of reinforced learning mechanism. Also, the 
chemical pheromone evaporates easily, thus reducing 
the pheromone level on both the paths. Ultimately the 
pheromone level on the longer path drops to nearly zero 
and hence all the ants use the smaller paths after some 
time. Note that the path from point B to food source 
is negligibly small and hence the pheromone level on 
it too increases rapidly and this becomes the selected 
path at point B instead of the longer path going back. 
As mentioned earlier, Marco Dorigo modelled 
this behaviour of ants mathematically on small agents 
called Artificial Ants. These ants take probabilistic deci- 
sions over a set of possible solutions as a function of 
pheromone value associated with that path and heuristic 
information which is based on the input data. During 
each iteration, a solution is chosen with a probability 
which is given by (deCastro & Von Zuben, 2005): 






(l) 



In Equation (1), r.. refers to the pheromone level 
and /7.. is the heuristic information. 

Equation (1) is called the Random Proportional 
Transition Rule (deCastro & Von Zuben, 2005). The 
pheromone and heuristic information can be weighted 
using a and /? and hence, a, /? e R + . The pheromone 
value of a solution can also be updated on the basis 
of its performance by a process of reinforcement. The 
pheromone levels for good solutions can be reinforced 
by updating pheromone values on such solutions. This 
leads to faster convergence on optimal solutions. 



In addition to this, pheromones are evaporated 
proportionally on all solutions. If p denotes the evapo- 
ration rate, the pheromone value of a solution is found 
by multiplying it by a factor of (1-p). The purpose of 
evaporation is opposite to that of reinforcement. It aims 
at avoiding too rapid convergence of the solution, which 
might lead to a sub-optimal solution (Dorigo & Stlitzle, 
2003). Evaporation implements convenient forgetting, 
leading to exploration of a bigger solution space. 

AntHocNet 

AntHocNet (Di Caro, Ducatelle, & Gambardella, 
2004) is an Ant Colony-based routing technique for 
ad-hoc networks. It is a hybrid algorithm which re- 
lies on reactive route set up combined with proactive 
route probing, maintenance and improvement. It is an 
on-demand routing algorithm, which means that the 
paths between the source and destination are set up as 
and where required and is not computed beforehand. 
AntHocNet is also a table-based routing protocol which 
means that the tables are used by each node to keep 
track of all the paths from that node to other nodes in 
the network. Every on-demand routing algorithm in 
ad-hoc networks has the following three functionali- 
ties (Toh, 2002). 

1. Route Discovery: To find a route between the 
source and the destination. 

2. Route Selection: If multiple routes are present, an 
algorithm selects the route(s) that are used for the 
purpose of routing. This might include selecting 
routes from a given table of routes. 

3. Route Maintenance: To take suitable actions when 
routes break due to movement of nodes or link 
failure. 

Though an algorithm may offer functions related to 
security, fault tolerance and other application-specific 
functionalities, these are the three basic functionalities 
present in almost every routing algorithm. We would 
now discuss how AntHocNet implements all the three 
functionalities. 

Route Discovery 

As mentioned earlier, an AntHocNet is an on-demand 
routing protocol. Routes are found 'on-the-go\ In 
AntHocNet, small control packets called ant agents 



1532 



Swarm Intelligence Approach for Ad-Hoc Networks 



are used for transferring control information within the 
network. When a source node s wants to communicate 
with a destination node d, it looks for a path in its phe- 
romone table. If it does not find any path, it broadcasts 
a reactive forward ant F d s . Ant agents, which are copies 
of each other (like the ones that are broadcasted), are 
said to be belonging to the same generation of ants. 
A node i that receives the ant, in turn, looks for a path 
to the destination d in its own pheromone table. A 
pheromone table stores entries in the form T [ ^ where 
T nd represents the suitability of going over to neigh- 
bouring node n to reach d. If pheromone information 
is available, the next node is chosen on the basis of 
the Random Transition Rule given in Equation (1) (Di 
Caro, Ducatelle, & Gambardella, 2004). A node n is 
chosen with a probability P nd (Di Caro, Ducatelle, & 
Gambardella, 2004). 



Pnd. — 
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Here, N i d represents the set of neighbours over which 
path to the node d is known. On comparing Equation (1) 
with Equation (2), the reader would notice that the value 
of a denoting the importance of heuristic information 
is zero here andy? 1 signifies the parameter which can be 
used to control the exploratory behaviour of ants. 

If the node i does not have any entry in its pheromone 
table, it rebroadcasts the ant. To prevent flooding of the 
network by these broadcasts, when a node receives an 
ant which it had received before (that is, several ants 
of the same generation), it compares it with the ant 
with best performance. If the number of hops h and 



the travel time T p is within the acceptance factor a v 
then only it is rebroadcast by that node. Otherwise, the 
ant is discarded. An exception to this rule is made for 
ants which differ in the first-hop with each other. For 
such ants, a higher acceptance factor a 2 is used. This 
prevents kite-shaped paths (Pseudo multiple paths) (Di 
Caro, Ducatelle, & Gambardella, 2004) as shown in 
Figures 3 and 4. 

As an ant reaches the destination d, it is converted 
into a backward ant. A backward ant traverses the 
route P in the reverse direction, updating the value of 
pheromone at each node. The time taken to transport 
data packet over P is estimated as (Di Caro, Ducatelle, 
& Gambardella, 2004). 




n-l 



= 2> 



(3) 



Where f:_ 1 is the total time required to transfer Q (length 
of queue) + 1 packets to the MAC layer. This is found 
as (Di Caro, Ducatelle, & Gambardella, 2004) 



\Mmat ' * )*77*.ac 



(4) 



fikcc is estimated as (Di Caro, Ducatelle, & Gam- 
bardella, 2004) 



= <*TJ™c+ (l-a)t£ 



(5) 



where a e [0, 1]. 

If fj is the travelling time estimated by an ant, 
pheromone value entry in table T z at node z, given by 



Figure 3. Pseudo multiple path (kite shaped path) 3 



Figure 4. Paths with 1st hop different (accepted mul- 
tiple paths) 4 
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T^ d is updated as (Di Caro, Ducatelle, & Gambardella, 
2004) 



4 _(s±p»y 



(6) 



T ho is a fixed value representing single hop time 



hop 



in unloaded conditions. The value of T"^ d is updated as 
(Di Caro, Ducatelle, & Gambardella, 2004) 



7^ = y7^+ (l- r K (7) 

where y e [0, 1]. 

Data Routing 



Once the paths have been setup, routing is done sto- 
chastically. This means that when a node receives a 
data packet, it can choose to forward it to one of the 
several nodes in its list. A node chooses a next node 
from the table with a probability (Di Caro, Ducatelle, 
& Gambardella, 2004): 



P»* = 



<?»*)* 
W 7 ^ 



(8) 



If one compares Equation (8) with Equation (2), we 
find that, using An tHocNet, data packets are forwarded 
in a manner similar to control packets (reactive forward 
ant packets in this case). However, the value of /? 2 is 
chosen much higher than/^ so as to give preference to 
better paths while routing the data. On the other hand, by 
lowering the value of fi 1 we can give liberty to reactive 
forward ants to be more explorative in nature. 

Even though a node prefers to forward packets to a 
next node which lies on a better path, as the load on a 
network rises, the pheromone value on this particular 
path starts decreasing and hence the probability of 
selecting other paths increases. This kind of stochastic 
data routing leads to automatic load balancing. 

Route Maintenance 

As mentioned earlier, AntHocNet uses proactive route 
probing and maintenance. Source node unicasts proac- 
tive ants to the destination. These ants are forwarded 
by each node in a manner similar to the other control 
packets. However, they also have a small probability 



of being broadcasted. Hence, an ant which reaches the 
destination node after selected number (say, one or two) 
of broadcasts finds out a fresh route to destination. If 
an ant is not broadcasted, it finds out fresh information 
regarding this route. 

Anode uses hello packets to track its neighbours. If 
a node finds out a new neighbour, it adds it to its rout- 
ing table. However, if a node discovers that a node has 
moved from its neighbourhood (for example, if it does 
not respond to the hello messages), a node broadcasts 
its neighbour, it removes it from its routing table and 
broadcasts a link failure notification, so that the other 
nodes in the network update their table to include that 
the node does not have a route to the destination any 
more. 

If, however, the node realizes that the problem is 
due to link failure and not node movement, it tries to 
repair the path by initiating a route repair ant to the 
destination. This ant tries to find an alternate path to the 
destination and if it does not return within a specified 
time, the node assumes a failure, and broadcasts link 
failure notification. 



CONCLUSION 

This chapter presents an example of how ACO tech- 
niques can be applied to solve problems in ad-hoc 
networks. We first introduced the food foraging be- 
haviour of ants and explain the phenomenon using the 
concept of pheromone. We then explained the concept 
of artificial ants and other ACO techniques and explain 
the random probability rule. 

We, then, introduce the AntHocNet algorithm (Di 
Caro, Ducatelle, & Gambardella, 2004) that uses ACO 
techniques, for routing in mobile ad-hoc networks. We 
explained how different routes are discovered using 
reactive forward ants. We also explained how random 
probability rule is used in the case of AntHocNet. We 
then presented how data can be routed stochastically 
using these routes and the automatic load-balancing 
that it achieves. In the end, we explained how these 
routes are maintained using hello packets and link 
failure notification. 
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KEY TERMS 

Ant Colony Optimization: Ant Colony Optimi- 
zation involves a set of algorithms modelled on the 
foraging behaviour of a colony of natural ants. 

AntHocNet: An Ant Colony Optimization-based 
algorithm for routing in ad-hoc networks using reactive 
route set up combined with proactive route probing, 
maintenance and improvement. 

Heuristic Information: Static value associated with 
a solution that represents the relative suitability of a 
solution among its peers based on intuition, previous 
experience or common sense. 

Pheromone: Chemical secreted by natural ants, the 
presence of which is an indicative of the number of ants 
that have followed a particular path. This chemical is 
modelled to represent the historical preference that is 
associated with a path in Ant Colony Optimization. 
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Proactive Forward Ant: Control ant agents that ENDNOTES 

are unicast to destination node and are responsible for 

finding fresh information about existing routes or to 1&2 Based on Di Caro, Ducatelle, & Gambardella, 

find fresh nodes to the destinations. 2004 

~ . „ j . A . U1 £ 3&4 Based on Di Caro, Ducatelle, & Gambardella, 

Reactive Forward Ant: Ant agents responsible for 

discovering paths to the destination nodes. 

Stigmergy: Method of indirect communication 
between simple agents by altering their environment. 
Ants use a chemical called Pheromone to communicate 
with each other, which is an example of stigmergy. 

Swarm Intelligence: Group of bio-inspired algo- 
rithms which is modelled on the collective behaviour 
of a group of social organisms such as ants, termites, 
bees, fishes and birds. 
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INTRODUCTION 

Swarm Robotics is a biologically inspired approach 
to the organisation and control of groups of robots. 
Its biological inspiration is mainly drawn from social 
insects, but also from herding and flocking phenomena 
in mammals and fish. The promise of emulating some 
of the efficient organisational principles of biological 
swarms is an alluring one. In biological systems such 
as colonies of ants, sophisticated cooperative behav- 
iour emerges despite the simplicity of the individual 
members, and the absence of centralised control and 
explicit directions. Such societies are able to maintain 
themselves as a collective, and to accomplish coordi- 
nated actions such as those required to construct and 
maintain nests, to find food, and to raise their young. 
The central idea behind swarm robotics is to find 
similar ways of coordinating and controlling collec- 
tions of robots. 



BACKGROUND 

The mechanisms that underlie social insect behaviour 
have inspired an approach that emphasises autonomy, 
emergence and distributed functioning, and avoids a 
reliance on centralised control and communication. 
This approach underlies both swarm robotics, and the 
closely related notion of artificial "swarm intelligence" . 
The term "swarm intelligence" was first coined in the 
context of cellular robotic systems, on the basis of the 
features that the simulated robotic collections shared 
with social insects: namely "decentralised control, lack 
of synchronicity, simple and (quasi) identical members" 
and size (Beni and Wang, 1989). Bonabeauetal(1999) 
describe as swarm intelligence, "any attempt to design 
algorithms or distributed problem-solving devices 
inspired by the collective behaviour of social insect 
colonies and other animal societies" (pg 7, Bonabeau 
et al, 1999). The key ingredients of swarm intelligence 
that they emphasise are self-organisation, and stigmergy, 



(indirect communication via the environment). Mar- 
tinoli (2001) similarly describes the swarm intelligence 
approach as emphasising "parallelism, distributedness, 
and exploitation of direct (agent-to-agent) or indirect 
(via the environment) local interactions among rela- 
tively simple agents. 

Swarm robotics has been described as the application 
of swarm intelligent principles to collective robotics 
(Sharkey and Sharkey 2006). The same principles of 
decentralised local control and communication are 
applied to physically instantiated robots. In swarm 
robotics, the emphasis is on using a number of simple 
robots that are autonomous, not subject to global con- 
trol, and that have limited communication abilities. 
The reliance on local communication means that the 
potential problems of communication bottlenecks, or 
centralised failure, are avoided. The system benefits 
from the redundancy of using several robots: if indi- 
vidual robots were to fail, others could take over, and 
new ones could be added without the need for recali- 
bration of communicative systems. In the same way, 
the activities of an ant colony need not be affected by 
the removal of some of its members. The simplicity 
of the individual robots means that they are able to 
respond quickly to the environment. There are also 
several tasks, such as exploring an environment, that 
can be accomplished more efficiently if a number of 
robots are used. 

Of course, using a collection of robots creates some 
new problems itself (Bonabeau et al, 1999). There is 
the possibility of stagnation: without global knowledge, 
a group of robots can find themselves in a deadlock 
situation. Too many robots trying to reach the same 
location, or perform the same task could obstruct each 
other. Another problem is finding a solution to a task: 
how can situations be engineered in order that a desired 
solution can emerge? Nonetheless, the promise of being 
able to send a number of autonomous robots to perform 
a task, particularly in sites that are remote and inhospi- 
table to humans, outweighs the disadvantages. 
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SWARM ROBOTICS 

Early work in swarm robotics can be illustrated by 
describing a series of studies in which simple robots 
are shown to be able to collect a number of objects 
in one place, and even to sort them. This work was 
initiated by a paper by Deneubourg et al (1991), and 
observations of the ability of ants to work together to 
sort their brood into clusters of eggs, larvae and cocoons, 
despite the insects' limited communicative abilities. 
In their simulations, "ant-like robots" (ALRs) moved 
randomly in a two dimensional environment populated 
by objects, and showed a greater probability of picking 
up the isolated items they encountered, and a greater 
probability of dropping them at locations where more 
items of that type are present. Their simulations 
demonstrated that the model eventually resulted in 
clustering and sorting of objects. Beckers et al (1994) 
applied these ideas to actual robots. Their robots had 
IR sensors for obstacle avoidance, a gripper to pick 
up the objects, and a microswitch that was activated 
when they pushed three pucks or more. They could 
(i) travel in a straight line until (ii) an obstacle was 
detected, whereupon they would turn to avoid it, or 
(iii) until their micro switch was activated, whereupon 
they would drop the pucks they were carrying, and turn 
away. Since the robots^ grippers would automatically 
collect up pucks they encountered, these behaviours 
were sufficient to result in the eventual collection of all 
the objects in a single cluster. Holland and Melhuish 
(1999) extended these results: augmenting the robots^ 
behaviours with a "pull-back" rule that required robots 
to pull pucks of one colour back for some distance 
before releasing them. Its effect was that (after several 
hours), pucks scattered across the arena were collected 
up by the robots, and sorted into clusters of different 
colours. More recently, Wilson et al (2004) reported 
further investigations of different minimalist solutions 
to 'ant-like annular sorting' using simple robots and 
simple mechanisms. 

Other swarm robotic studies have also explored the 
behaviours that can be accomplished by robots that re- 
spond in a fixed manner to environmental stimuli, and 
that do not directly communicate with each other. A 
number of studies were designed to investigate explic- 
itly cooperative tasks, (tasks that have been designed 
to require cooperation), such as pushing a box that is 
too heavy to be pushed by a single robot (Kube and 
Zhang, 1996; Kube and Bonabeau, 2000). Stick pulling 



(Ijspeert et al, 2001) is a similarly explicitly cooperative 
task that involved locating sticks in a circular arena 
and pulling them out of the ground in circumstances 
where the length of the stick means that a single robot 
cannot pull it out by itself, but must collaborate with a 
second robot. Ijspeert et al (2001) used reactive robots 
with minimal sensing abilities. Their results show that 
collaboration can still be obtained despite the absence 
of signalling, planning, or direct communication. 

These studies share a number of features. They all 
involve a number of robots. The robots are autono- 
mous, and not controlled centrally; the control methods 
used could be scaled up to larger numbers of robots, 
or scaled down to smaller numbers since each robot 
performs a set number of fixed behaviours in response 
to certain stimuli. The individual robots are certainly 
simple - they have no knowledge of the environment 
they are in, or even of the other robots in it. They 
are essentially reactive: they have no knowledge or 
map of their environment, and they have no ability to 
communicate directly with other robots, or to receive 
instructions. Nonetheless, they exhibit apparently co- 
operative behaviour. Many of the studies make use of 
the concept of stigmergy, a term introduced by Grasse 
(1959) in the context of his observations of termite 
building behaviour. He noted that termite workers 
were stimulated to further constructive activity in the 
presence of particular features of a construction. The 
behaviour of the termite is affected by changes in the 
environment created either by itself, or by othertermites: 
a form of indirect communication, where environmental 
changes have a signalling function. All of the examples 
discussed here explicitly draw analogies and parallels 
to living biological systems. Together, they illustrate 
some of the potential of swarm robotics: despite the 
simplicity of the individual robots, their interactions 
with the environment result in the performance of tasks 
in the physical world, and demonstrate that cooperation 
between such simple entities can emerge in the absence 
of any planning, centralised coordination, or even any 
direct communication between the robots. 

Nonetheless, as research in swarm robotics has 
developed, so has a certain lack of clarity and agree- 
ment about the terms to be used and about what their 
defining features are (see also Dorigo and Sahin, 2004). 
There is agreement that swarm robotics implies the use 
of control and communication methods that are decen- 
tralised and scalable, so that communication bottlenecks 
are avoided, the robots operate autonomously, and the 
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same approach could be applied unchanged to varying 
sizes of robot collection. It is less clear whether swarm 
robotics necessarily implies the use of reactive control 
and constraints on the kinds of communication involved. 
Bonabeau et al (1999) suggest that swarm-based robot- 
ics may be "loosely defined" as "reactive collective 
robotics" (pg 19), and that "swarm-based robotics relies 
on the anti-classical AI idea that a group of robots may 
be able to perform tasks without explicit representations 
of the environment and of the other robots". Should 
swarm robotics be restricted to the use of robots with 
such minimal representational abilities? 

Arguments in favour of restricting swarm robotics 
to the use of reactive robots can be made on the basis 
of parsimony. Pfeifer and Scheier (2001) argue that 
a more parsimonious model should be preferred over 
more complicated ones, and that, for instance, a model 
that can explain "the clustering behaviour of ants based 
on simple reflexes, for example, is to be preferred over 
one that postulates some sort of internal representation 
of clustering". Of course, there are practical advantages 
to be gained from attempting to accomplish a given task 
with the simplest possible mechanism. Minimalist unit 
design and a reliance on reactive robots that respond 
rapidly to stimuli in the environment can facilitate a 
rapid response to changing situations, and lead to the 
use of robots that are relatively cheap and expendable. 
Wilson et al (2004) for example, provide a practical 
justification for their minimalist approach that relies 
on robots built from simple mechanical components 
and sensors, claiming, "Potentially, this allows for the 
production of more robust and cheaper robot units.... 
Simple behavioural rules are employed so the robots 
are less complex. Rules have to be embodied and 
realized in a machine, and the more complicated the 
rules, the more complicated the hardware and software 
in the machine is likely to be. The more complicated 
the hardware and software required, the more there is 
to go wrong." The problem is that there is likely to 
be a limit to the number and kinds of task that can be 
accomplished using reflexive behaviours, and avoiding 
internal representation. 

Another reason for an emphasis on reactive robots in 
swarm robotics is based on its inheritance from behav- 
iour-based robotics. Rodney Brooks and his associates 
introduced the idea of behaviour-based robotics, and 
the advantages to be gained from departing from the 
traditional emphasis on reasoning and representation 
in Artificial Intelligence (Brooks 1999). They showed 



that certain tasks could be solved more easily by robots 
that were situated in the world, and could react to and 
exploit characteristics of the environment, than by 
robots such as Shakey (Nilsson, 1984) that depended 
on a representationally intensive, and slow approach 
to modelling the world. However, again the range of 
tasks to which robots without representations can be 
applied is limited, and more recent formulations of 
behaviour-based robotics (Mataric 1997) incorporate 
the idea of action-centred representations. 

A final reason for preferring to use reactive robots 
in swarm robotics can be found in assumptions about 
the limited abilities of the social insects that inspire 
them. Debates over parsimonious explanation have 
always occurred in biology. For instance, Griffin ( 1 992) 
has argued that there has been a long held view that 
social insects are little more than "genetically pro- 
grammed clockwork". Under such a view, insects can 
only react to stimuli, and are not able to represent the 
world, or to communicate amongst themselves. This 
view of insects as clockwork (albeit clockwork with 
sensors that enable it to respond to the world) is one 
that can be traced back to behaviourism's response to 
the anthropomorphism that preceded it. The synopsis 
of a book on insect learning (Papaj and Lewis, 1992) 
claims that "until recently, insects were viewed as 
rigidly programmed automatons: now however, it is 
recognised that they can actually learn and that their 
behaviour is plastic". 

There is a gradually accumulating body of evi- 
dence that shows that social insects do have some 
representational and learning abilities, and that their 
communicative abilities are more extensive than was 
once supposed. For example, Collett and Collett (2002) 
review evidence for memory use in insect visual naviga- 
tion, describing evidence that shows reliable recogni- 
tion of visual landmarks, and reliable performance of 
learned routes. Franks and Richardson (2006) report 
evidence that the ant Temnothorax albipennis can use 
tandem running to lead another ant from nest to food, 
and make use of bi-directional feedback, as the leader 
ant modifies its behaviour when being followed - the 
leader teaches the route to the follower. Robinson et al 
(2005) have shown that as well as laying a pheromone 
trail to guide others to a food sources, Pharaoh's ants 
can also lay a negative "no entry" signal to mark an 
unrewarding trail path. 

Such findings could justify extending the capabili- 
ties of individual robots in a swarm beyond those of 
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reactive control. Robots with some ability to learn a 
route, to recognise landmarks, or to keep track of the 
number of encounters they have with others would be 
able to perform a wider range of tasks. It would be 
interesting to explore the ways in which such abilities 
could be incorporated into swarm robotics. An approach 
of "biologically plausible minimalism", in which the 
representational and communicative abilities of the ro- 
bots were restricted to those plausible for social insects 
would ensure that any such approach still preserved the 
swarm advantages of decentralisation and scalability 
shown in their biological counterparts. 



FUTURE TRENDS 

An avenue that could be explored in future swarm 
robotic research is that of incorporating simple forms 
of memory and representational ability, without com- 
promising the swarm-related benefits of local control 
and communication and scalability. Relatively simple 
robots could, for instance, be given some minimal 
representational abilities: the ability for instance to 
learn a route, or to recognise landmarks. Similarly, it 
would be interesting to explore the use of some further 
communicative abilities other than that of pheromone 
trail laying. For example, robots could be given the 
ability to convey, and to sense, the tasks that they and 
other robots are involved in, and to keep account of 
the frequency of their encounters. This would enable 
some distributed decision making abilities, and dynamic 
switching between tasks based on their local records 
of the numbers performing each task. These limited 
cognitive abilities would still depend on entirely local 
control, and would be scalable, but such extensions 
could be used to extend the range and complexity of 
tasks to which swarm robotics could be applied. 



CONCLUSION 

In this article, we have surveyed swarm robotics research 
and discussed the source of its biological inspiration 
- the self-organised behaviour of social insects. Some 
representative studies have been described, and their 
common characteristics noted. These include the ideas 
that the robots in a swarm should be simple, autonomous, 
and subject to local control and communication. The 
expected benefits of using such robots are that they 



should be able to provide a robust and flexible solution 
for practical applications in inaccessible areas; one that 
benefits from an inherent redundancy, since robots could 
fail or be replaced without the need for recalibration of 
the control and communication methods. 

The approach is of interest, but still in its early 
stages. There is still some disagreement about the 
use of the term ' swarm robotics', and the constraints 
it implies. In particular, it is not clear whether swarm 
robotics necessarily involves the use of reactive robots 
with effectively no representational ability. There are 
reasons to prefer the simplest possible solution for a 
given task, but the argument is made here that there 
is evidence that social insects do have some ability to 
represent the environment, and that incorporating such 
abilities into swarm robotics would extend the range of 
tasks to which the approach could be applied, without 
compromising its swarm-related advantages. 
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KEY TERMS 

Behaviour-Based Robotics (BBR): A paradigm 
initiated by Brooks (1999) that stressed the importance 
of studying robots situated in the world, and responding 
to information directly gathered by their sensors. BBR 
robots make minimal use of internal representations. 

Emergent Behaviour: Results from the unsuper- 
vised interaction of a number of simpler processes. 
The complex behaviour of an ant colony is a good 
example of emergent behaviour: the individual ants 
carry out their tasks on a local level, but the combined 
effect is of a colony that is able to maintain its own 
organisation. 

Reactive Robotics: An approach to robot control 
in which there is a direct mapping from the sensor 
input to the robot, and its motor output. No use is 
made of internal representations of the world. The 
approach dates from the reactive robots developed by 
Grey Walter (1954). 

Self-Organisation: Pattern-formation processes in 
physical and biological systems that occur as a result of 
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interactions internal to the system, without intervention 
by external influences. 

Stigmergy: A method of indirect communication 
that occurs when one individual modifies the environ- 
ment, and another responds to that environment at a 
later time. 

Swarm Intelligence: Describes attempts to design 
algorithms and to solve problems, using methods in- 
spired by observations of the collective behaviour of 
biological groups such as insect colonies. 

Social Insects: Insects that live cooperatively in 
colonies and exhibit a division of labour among distinct 
castes. E.g. termites, ants, bees, some wasps. 
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INTRODUCTION 

The topic of representation acquisition, manipulation 
and use has been a major trend in Artificial Intelligence 
since its beginning and persists as an important matter 
in current research. Particularly, due to initial focus on 
development of symbolic systems, this topic is usually 
related to research in symbol grounding by artificial 
intelligent systems. Symbolic systems, as proposed by 
Newell & Simon (1976), are characterized as a high- 
level cognition system in which symbols are seen as 
"[lying] at the root of intelligent action" (Newell and 
Simon, 1976, p. 83). Moreover, they stated the Physi- 
cal Symbol Systems Hypothesis (PSSH), making the 
strong claim that "a physical symbol system has the 
necessary and sufficient means for general intelligent 
action" (p. 87). 

This hypothesis, therefore, sets equivalence between 
symbol systems and intelligent action, in such a way 
that every intelligent action would be originated in a 
symbol system and every symbol system is capable 
of intelligent action. The symbol system described by 
Newell and Simon (1976) is seen as a computer program 
capable of manipulating entities called symbols, 'physi- 
cal patterns' combined in expressions, which can be 
created, modified or destroyed by syntactic processes. 
Two main capabilities of symbol systems were said to 
provide the system with the properties of closure and 
completeness, and so the system itself could be built 
upon symbols alone (Newell & Simon, 1976). These 
capabilities were designation - expressions designate 
obj ects - and interpretation - expressions could be pro- 
cessed by the system. The question was, and much of 
the criticism about symbol systems came from it, how 
these systems, built upon and manipulating just symbols, 
could designate something outside its domain. 

Symbol systems lack 'intentionality', stated John 
Searle (1980), in an important essay in which he de- 



scribed a widely known mental experiment (Gedan- 
kenexperiment), the 'Chinese Room Argument'. In this 
experiment, Searle places himself in a room where he 
is given correlation rules that permits him to determine 
answers in Chinese to question also in Chinese given 
to him, although Searle as the interpreter knows no 
Chinese. To an outside observer (who understands 
Chinese), the man in this room understands Chinese 
quite well, even though he is actually manipulating 
non-interpreted symbols using formal rules. For an 
outside observer the symbols in the questions and 
answers do represent something, but for the man in 
the room the symbols lack intentionality. The man in 
the room acts like a symbol system, which relies only 
in symbolic structures manipulation by formal rules. 
For such systems, the manipulated tokens are not 
about anything, and so they cannot even be regarded 
as representations. The only intentionality that can be 
attributed to these symbols belongs to who ever uses 
the system, sending inputs that represent something to 
them and interpreting the output that comes out of the 
system. (Searle, 1980) 

Therefore, intentionality is the important feature 
missing in symbol systems. The concept of intentional- 
ity is of aboutness, a "feature of certain mental states by 
which they are directed at or about objects and states of 
affairs in the world" (Searle, 1980), as a thought being 
about a certain place. 1 Searle (1980) points out that a 
'program' itself can not achieve intentionality, because 
programs involve formal relations and intentionality 
depends on causal relations. Along these lines, Searle 
leaves a possibility to overcome the limitations of mere 
programs: ' machines '-physical systems causally con- 
nected to the world and having 'causal internal powers' 
- could reproduce the necessary causality, an approach 
in the same direction of situated and embodied cogni- 
tive science and robotics. It is important to notice that 
these 'machines' should not be just robots controlled 
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by a symbol system as described before. If the input 
does not come from a keyboard and output goes to a 
monitor, but rather came in from a video camera and 
then out to motors, it would not make a difference since 
the symbol system is not aware of this change. And 
still in this case, the robot would not have intentional 
states (Searle 1980). 

Symbol systems should not depend on formal rules 
only, if symbols are to represent something to the 
system. This issue brought in another question, how 
symbols could be connected to what they represent, 
or, as stated by Harnad (1990) defining the Symbol 
Grounding Problem: 

"How can the semantic interpretation of a formal 
symbol system be made intrinsic to the system, rather 
than just parasitic on the meanings in our heads? How 
can the meanings of the meaningless symbol tokens, 
manipulated solely on the basis of their (arbitrary) 
shapes, be grounded in anything but other meaning- 
less symbols? " 

The Symbol Grounding Problem, therefore, rein- 
forces two important matters. First that symbols do not 
represent anything to a system, at least not what they 
were said to ' designate'. Only someone operating the 
system could recognize those symbols as referring to 
entities outside the system. Second, the symbol system 
cannot hold its closure in relating symbols only with 
other symbols; something else should be necessary 
to establish a connection between symbols and what 
they represent. An analogy made by Harnad (1990) is 
with someone who knows no Chinese but tries to learn 
Chinese from a Chinese/Chinese dictionary. Since terms 
are defined by using other terms and none of them is 
known before, the person is kept in a ' dictionary-go- 
round' without ever understanding those symbols. 

The great challenge for Artificial Intelligence 
researchers then is to connect symbols to what they 
represent, and also to identify the consequences that 
the implementation of such connection would make to 
a symbol system, e.g. much of the descriptions of sym- 
bols by means of other symbols would be unnecessary 
when descriptions through grounding are available. It 
is important to notice that the grounding process is not 
just about giving sensors to an artificial system so it 
would be able to* see 'the world, since it 'trivializes' the 
symbol grounding problem and ignores the important 



issue about how the connection between symbols and 
objects are established (Harnad, 1990). 



BACKGROUND 

The symbol grounding problem aroused from the notice 
that symbol systems manipulated structures that could 
be associated with things in the world by an observer 
operating the system, but not by the system itself. The 
quest for symbol grounding processes is concerned with 
understanding processes which could enable the con- 
nection of these purely symbolic representations with 
what they represent in fact, which could be directly, or 
by means of other grounded representations. 

This represents a technological challenge as much as 
a philosophical and scientific one, but there is a strong 
interrelation between them. From one side there is the 
concern with the technological design and engineering 
of symbol grounding processes in artificial systems. 
On the other side, the grounding process is a process 
present in natural systems and therefore precedes 
artificial systems. Theories and models are developed 
to explain grounding and if consistent and detailed 
enough may in principle be implemented in artificial 
systems, which in return correspond to a laboratory 
for these theories, when their hypothesis are tested and 
new questions are raised, allowing further refinement 
and experimentation. 

A first proposal for symbol grounding as made 
by Harnad (1990) in the same paper where he gave a 
definition for the 'symbol grounding process'. Harnad 
proposed that symbolic representations should be 
grounded bottom-up by means of non-symbolic repre- 
sentations: iconic representations - sensory proj ections 
of objects - and categorical representations - invariant 
features of objects. Neural networks were pointed out 
as a feature learner and discriminator, which could link 
sensory data with symbolic representations, after been 
trained to identify the invariant features. This would 
causally connect symbols and sensory data, but this 
proposal describes just a tagging system that gives 
names to sensed objects but does not use this to take 
actions and interact with its environment. A 'mental 
theater' is formed as Dennett (1991) defined, where 
images are projected internally and associated with 
symbols, but no one is watching it. Besides the symbols 
and the iconic representations are probably given by 
the systems operator and the system must learn them 
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all, making no distinction between them and attributing 
no functionality to them. 

Another approach to deal with the limitation of 
symbol systems was presented by Brooks (1990). 
Instead of modeling artificial systems as symbol 
systems, Brooks rejected the symbolic approach for 
cognition modeling and the need of representations 
for this end: "[representation is the wrong unit of 
abstraction in building the bulkiest parts of intelligent 
systems"(Brooks, 1991b, p. 139), with representations 
seen as centralized, explicit and pre-defined structures. 
He proposed the Physical Grounding Hypothesis 
(Brooks, 1990), supporting that intelligent systems 
should be embedded in the real world, sensing and act- 
ing in it, establishing causal relations with perceptions 
and actions, being built in a bottom up manner with 
higher levels depending on lower ones. There was no 
need for representations because the system is already 
in touch with the objects and events it would need to 
represent. Moreover, Brooks called up attention that the 
most important aspect of intelligence was left out: to 
deal with the world and its dynamics. Instead of dealing 
with sophisticated high-level processes dealing with 
simple domains, research in the so-called Nouvelle AI 
should focus in simpler processes dealing with greatly 
complicated domains (such as the real world) and work 
its way from there to higher level ones (Brooks, 1990). 
Brooks (1991a) also stated principles to this new ap- 
proach, such as situatedness and embodiment, which 
are the mottos of the situated and embodied cognition 
studies (Clark, 1997). 

Symbolic representations are in fact not incom- 
patible with the Physical Grounding Hypothesis and 
with the Situated and Embodied Cognition approach. 
Brooks (1990) himself pointed out high-level abstrac- 
tions should be made ' concrete' by means lower-level 
processes, thus symbolic representations should be 
causally constructed from the situated/embodied inter- 
action dynamics through the artificial agent's history. 
This approach was followed by several researchers 
dealing with symbol grounding when building artificial 
systems in which representations emerge from agent's 
interactions, when learning processes take place (e.g. 
Ziemke 1999, Vogt 2002, Cangelosi 2002, Roy 2005; 
see also Christiansen & Kirby 2003; Wagner, 2003, for 
a review of experiments about language emergence). 

In most of these new systems, artificial agents are 
situated in an environment, in which they can sense 
and act, and are allowed to interact with other agents 



- either artificial or biological ones. By means of 
associative learning mechanisms, agents are able to 
gradually establish relations between representations 
and what they represent in the world, using communi- 
cation as the basis to guide this learning process. And 
remarkably, when explicitly discussing the ' symbol 
grounding problem' (Vogt 2002, Cangelosi et al. 2002, 
Roy 2005), the sign theory of Charles Sanders Peirce, 
particularly his definition of a symbol, is brought forth 
as the theoretical background for a new view into this 
problem. 



SEMIOTICS AND SYMBOL GROUNDING 
PROBLEM 

The symbol grounding problem is fundamentally a 
matter of how certain things can represent other things 
to someone. Although symbol systems were said to 
have 'designation'properties (Newell & Simon, 1976), 
which would allow symbols manipulated by the sys- 
tem to stand for objects and events in the world, this 
property should actually be attributed to an outside 
observer who was the only one able to make this con- 
nection. The artificial system itself did not have this 
capability, so the symbols it manipulated were said to 
be ungrounded. Building artificial systems based on the 
hypothesis that symbolic processes were autonomous 
and no other process was required proved to be flawed, 
and the quest to understand representation processes 
came up as a major issue. 

Representation is the focus of semiotics, the ' formal 
science of signs' as defined by Charles Sanders Peirce. 
His definition of Semiotics and his pragmatic notion of 
meaning as the ' action of signs' (semiosis), have had 
deep impact in philosophy, psychology, theoretical 
biology, and cognitive sciences. Sign model and clas- 
sification was developed by Peirce from his logical- 
phenomenological categories. His definition of a sign as 
"something which stands to somebody for something in 
some respect or capacity" (Peirce 1931-1958, §2.228) 
interrelates three distinct elements: a sign, an object, 
which the sign represents in some respect, and an ef- 
fect (interpretant) on an interpreter. The nature of the 
relation between sign and object establishes the 'most 
fundamental division of signs': signs can either be 
icons, indexes or symbols. Icons stand for the object 
through resemblance or similarity, since it carries 
properties in common with the object. The drawing 
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of an object, a diagram and 'sensory projections' are 
regarded as icons. Indexes establish spatio-temporal 
physical relation with its object, both occur as events 
and the interpreter is not responsible for connection 
between them, he just remarks it when it is established 
(Peirce 1931-1958, §2.299). Examples of indexes are 
smoke, which is related with fire, a scream that calls 
up our attention, or a bullet hole. According to Peirce, 
a symbol is a sign because it is interpreted as such, due 
to a natural or conventional disposition, in spite of the 
origin of this general interpretation rule (Peirce 1931- 
1958, §2.307). A word, a text and even a red light in a 
traffic light alerting drivers to stop are symbols. In this 
symbolic process, the object which is communicated 
to the interpretant through the sign is a lawful relation- 
ship between a given type of sign and a given type of 
object. Generally speaking, a symbol communicates a 
law to the interpretant as a result of a regularity in the 
relationship between sign and object. 

Furthermore, it is important to remark that symbols, 
indexes and icons are not mutually exclusively classes; 
they are interrelated and interdependent classes. "A 
Symbol is a law, or regularity of the indefinite future. 
[...] But a law necessarily governs, or "is embodied 
in" individuals, and prescribes some of their qualities. 
Consequently, a constituent of a Symbol may be an 
Index, and a constituent maybe an Icon" (Peirce 1931- 
1958, §2. 293). Symbols require indexes which require 
icons. Harnad (1990) already noticed that symbols need 
non-symbolic representations and proposed that sym- 
bols are to be connected with sensory projections and 
categorical features, both regarded as icons in Peirce 's 
theory. A symbol is a sign and as such it involves an 
object which it refers to and an interpretant, the effect 
of the sign, so a symbol can only represent something 
to someone and when someone is interpreting it. A 
symbol distinguishes itself from other signs since it 
holds no resemblance or spatial-temporal relation 
with the object and thus depends on a general rule 
or disposition from the interpreter. At last, symbols 
incorporate indexes (and icons, consequently) and one 
way symbols can be acquired is by exploiting indexi- 
cal relations between signs and objects, establishing 
regularities between them. 



FUTURE TRENDS 

The discussion around the symbol grounding problem 
has an important component of theoretical aspects since 
it involves issues such as representation and cognitive 
modeling. Nevertheless, it is also a technological con- 
cern if researchers in Artificial Intelligence intend to 
model and build artificial systems which are capable 
of handling symbols in the appropriate way. The most 
evident consequence of discussion the problem of 
symbol grounding and ways of solving it is related 
to language and more generally with communication 
systems between agents (artificial ones or not). If we 
expect a robot to act appropriately when we say 'bring 
me that cup', we should expect it to know what these 
symbols represent so it will act accordingly. Moreover, 
we expect an artificial agent to learn and establish 
symbol-object connections autonomously, without 
the need of programming everything prior to the robot 
execution or reprogram it every time a new symbol is 
to be learned. 

The employment of strong theoretical basis that 
describes thoroughly the process of interest can cer- 
tainly contribute to the endeavor of modeling and 
implementing it in artificial systems. The semiotics of 
Charles S. Peirce is recognized as a strongly consistent 
theory, and has been brought forth by diverse research- 
ers in Artificial Intelligence, though fragmentally. We 
expect that Peirce description of sign processes will 
shed light on the intricate problem of symbol ground- 
ing. Particularly, Peirce conception of meaning as sign 
action can open up perspectives on the implementation 
of semiotic machines, which can produces, transmits, 
receives, computes, and interprets signs of different 
kinds, meaningfully (Fetzer 1990). According to the 
pragmatic approach of Peirce, meaning is not an in- 
fused concept, but a power to engender interpretants 
(effects on interpreters). According to Peirce's prag- 
matic model of sign, meaning is a, context-sensitive 
(situated), interpreter-dependent, materially extended 
(embodied) dynamic process. It is a social-cognitive 
process, not merely a static system. It emphasizes 
process and cannot be dissociated from the notion of 
a situated (and actively distributed) communicational 
agent. It is context-sensitive in the sense that is deter- 
mined by the network of communicative events within 
which the interpreting agents are immersed with the 
signs, such that they cooperate with one another. It is 
both interpreter-dependent and objective because it 
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triadically connects sign, object, and an effect in the 
interpreter. 



CONCLUSION 

The original conception of artificial intelligent systems 
as symbols systems brought forth a problem known as 
symbol grounding problem. If symbol systems manipu- 
late symbols, these symbols should represent something 
to the system itself and not only to an external observer, 
but the system has no way of grounding these symbols 
in its sensory and motor interaction history, since it 
does not have one. Many researchers pointed this key 
flaw, particularly John Searle (1980) with his Chinese 
Room Argument and Stevan Harnad (1990) with the 
definition for the problem became well know. 

A direction towards modeling artificial systems as 
embodied and situated agents instead of symbol ma- 
nipulating systems was pointed out, urging the need to 
implement systems that could autonomously interact 
with its environment and with the things it should have 
representations for. But the topic of symbol grounding 
also needs a description of how certain things come 
to represent other things to someone, the topic of 
study of semiotics. The semiotics of C.S.Peirce has 
been used as theoretical framework in the discussion 
the topic of symbol grounding problem in Artificial 
Intelligence. The application of his theory in dealing 
with the symbol grounding problem should further 
contribute to the development of computational models 
of cognitive systems and to the construction of ever 
more meaningful machines. 
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KEY TERMS 

Icon: A sign that represents its object by means of 
similarity or resemblance. 

Index: A sign spatial-temporally (physically) con- 
nected with its object. 

Representation: The same as a sign. 

Sign: Something that stands for something else in 
a certain aspect to someone. 

Symbol: A sign that stands for its object by means 
of a law, rule or disposition. 

Symbol Grounding Problem: The problem re- 
lated to the requirement of symbols to be grounded in 
something else then other symbols, if a symbol is to 
represent something to an artificial system. 

Symbol Systems: A system that models intelligent 
action as symbol manipulation alone. 



ENDNOTE 

1 See also Dennett &Haugeland 1987, Searle 1983, 
Jacob 2003. 
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INTRODUCTION 

Symbolic search solves state space problems consist- 
ing of an initial state, a set of goal states, and a set of 
actions using a succinct representation for state sets. 
The approach lessens the costs associated with the 
exponential memory requirements for the state sets 
involved as problem sizes get bigger. 

Symbolic search has been associated with the term 
planning via model checking (Giunchiglia and Traverso 
1999). While initially applied to model check hardware 
verification problems (McMillan 1993), symbolic 
search features many modern action planning systems 
(Ghallab et al. 2000). 

Symbolic search algorithms explore the underly- 
ing problem graph by using functional expressions to 
represent sets of states and actions. Compared with the 
space requirements induced by standard explicit-state 
search algorithms, symbolic representations addition- 
ally save space by sharing parts of the state vector. 
Algorithm designs change, as not all search algorithms 
adapt to the exploration of state sets. 



BACKGROUND 



are finite state machines over the alphabet {0,1} with a 
1-sink that operates as an accepting state. Each internal 
node is labelled with the variable (index) for selecting 
the outgoing transition (either 1 or 0, see figure) for a 
given variable assignment. For evaluating a BDD, a 
path is traced from the root to the sinks (all paths obey 
the same variable ordering). What distinguishes BDDs 
from decision trees is the use of reduction rules, detect- 
ing unnecessary variable tests and repeating subgraphs. 
This leads to a unique representation, polynomial in 
the number of input variables for many interesting 
functions. The reduced and ordered BDD representa- 
tion is unique; a clear benefit to the satisfiability test 
for Boolean formulas, which by the virtue of Cook's 
Theorem (1971) is an NP-hard problem 

In symbolic search, BDDs accept the state vector 
representation. According functions are satisfied, if the 
state vector for the input assignment is a member of 
the represented set. The characteristic function can be 
identified with the state set it represents. 

The transition relation Trans represents the actions 
(see Figure 2). It refers to current state variables x 
and next state variables x' and is satisfied, if there is 
an action that transforms a state vector into one of its 
successors. The transition relation for the entire prob- 



Binary decision diagrams or BDDs are one option for 
a space-efficient representation for state sets. 

A BDD (Bryant 1992; see Figure 1), is a data struc- 
ture to manipulate Boolean functions efficiently. BDDs 



Figure 1. 




Figure 2. 
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lem decomposes in the disjunction of the transition 
relations for singleton actions. The order of variables 
in the state vector is crucially influencing the size of 
the BDD. Unfortunately, the problem of finding the 
ordering that minimizes the BDD size is NP-hard 
(Wegener 2000) . The interleaved representation for the 
Trans(x,x') that alternates between x and x' variables 
often leads to small BDDs. 

The image of a state set States wrt. the transition rela- 
tion Trans is computed as Image(x') := 3x(Trans(x,x') 
a States(x)), where x and x' are vectors of Boolean 
state variables. The result of this image operation is 
a characteristic function of all states reachable from 
States in one step. In order to repeat the process, x with 
x' have to be substituted for the next iteration by com- 
puting the relational product States(x) := 3x' ((x=x') 
a Image(x')). In an interleaved variable ordering with 
alternating indices for x and x\ this operation reduces 
to a mere textual replacement of node labels. 



SYMBOLIC SEARCH ALGORITHMS 

State space problems numbers of finite domain can 
be encoded via atomic propositions. A binary encod- 
ing is more efficient than a unary one such that most 
BDD libraries include finite domain variable support. 
For basic calculus, relations are pre-computed. For 
example, the binary relation Inc(a,b) for a+l=b is 
the disjunction of all possible value assignments of a 
to j and all possible value assignments of b to j+1 for 
all j counted from 1 to the domain size minus 1. For 
constructing the ternary relation Add(a,b,c), denoting 
a+b=c, the enumeration of all possible assignments 
for a, b and c is less efficient than computing the term 
Add(a,b,c) := (b=0 a a=c) v 3b',c' (Inc(b',b) a Inc(c',c) 
a Add(a,b',c')) recursively. Starting with the first clause 
the second clause is applied until convergence. 

Symbolic Breadth-First Search 

In iteration i of the symbolic variant of breadth-first- 
search the set of states States[i] reachable from the 
initial state s in i steps is computed. The search is 
initialized with States[0] set to the initial state set. In 
order to terminate the search the algorithm checks, 
whether or not a state is represented in the intersection 
of the set States[i] with the set of goal states. Since 
States[0],...,States[i-l] have been computed without 



success, given a non-empty intersection, i is the optimal 
solution length. To avoid an infinite search behaviour 
in case of the absence of a solution, Reach = States [0] 
v...v States[i-1] is omitted from States[i] by setting 
States[i] := States[i] A-iReach before updating Reach 
:= Reach v S[i]. For some problem classes (like un- 
directed or acyclic graphs) the duplicate elimination 
scope {0,...,i-l} can be reduced to a limited number 
of breadth-first search levels. 

By keeping the intermediate BDDs contained in the 
memory, a legal sequence of states linking the initial 
state to any goal state g in States[i] n G is a successful 
solution. The state on an optimal path to a goal g in 
layer i must be located in the second last breadth-first 
search layer i-1. All states that are contained in the 
intersection of the predecessors of the goal g are and 
States[i-1] are reachable in an optimal number of steps 
and reach the goal in one step. Any of these states can 
be chosen to continue solution reconstruction. Eventu- 
ally the initial state is found. If layers have been elimi- 
nated to recover main memory, divide-and-conquer 
solution reconstruction methods are required (Jensen 
et al. 2006). Variants of symbolic breadth-first search 
compute cost-optimal solutions subject to general cost 
functions (Edelkamp 2006). 

Backward breadth-first search exploits the relational 
representation for the actions to compute the preimage 
according to the formula Preimage(x) := 3x' (States(x') 
a Trans(x,x')). Consequently, the search starts with the 
goal state set and iterates until it hits the start state. 
Bidirectional symbolic breadth-first search executes 
concurrent iterations of forward and backward breadth- 
first search until the two search frontiers meet. 

Symbolic Dijkstra's Single Source 
Shortest Paths Algorithm 

Action costs are a natural search concept. In many ap- 
plications, costs can only be bounded integers. Examples 
for such discrete cost actions are macros as exploited 
in the macro-problem solver by Korf (1985). 

Let the weighted transition relation Trans(c,x,x') 
evaluate to 1, if the step fromxtox'hascostce{l,...,C}, 
encoded in binary. The symbolic version of Dijkstra's 
single-source shortest paths algorithm (1959) then 
works as follows. The priority relation Queue(f,x) is 
initialized with the representation of the start state and 
f- value 0. Until a goal state is reached, in each itera- 
tion, the algorithms determines the minimum f-value 
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min, the relation Min(x) of all states in the priority 
queue with value min, and the relation Rest(f,x) of 
the remaining set of states Queue(f,x) \ Min(x). The 
transition relation Trans(c,x,x') is then applied to Min 
to determine the relation for the successor state set. 
To attach the new values f=min+c to this set, relation 
Add mentioned above applies. Finally, the priority re- 
lation for the next iteration is obtained by intersecting 
the evaluated successor set with the remaining queue. 
The algorithm mimics the execution of Dijkstra's al- 
gorithm on 1-level bucket data structure (Dial 1969). 
As f increases monotonically, the first goal extracted 
from the priority queue has optimal cost. 

Symbolic Pattern Databases 

State space abstraction is the key aspect for the auto- 
mated design of search heuristics. Applying abstractions 
simplifies a problem, and exact distances in relaxed 
problems then serve as lower bound for the concrete 
base-level search. Proper abstractions preserve path 
existence. Pattern databases, introduced by Culberson 
and Schaeffer (1998), completely evaluate the abstract 
search space prior to the concrete search. The limitation 
for applying pattern databases in search practice is the 
restricted amount of (main) memory. 



More than one pattern database can be combined 
by either taking the maximum (always applicable), or 
the sum of individual pattern database entries (only ap- 
plicable if the pattern databases are disjoint). Disjoint 
pattern databases (Korf and Felner 2002) belong to the 
best known techniques to construct effective search 
heuristics. In order to select patterns automatically, 
Edelkamp (2000) as well as Haslum et al. (2005) use 
greedy pattern packing to divide the state vector into 
disjoint parts for constructing pattern databases that 
respect a pre-specified memory limit. 

Symbolic pattern databases (Edelkamp 2002) are 
functional pattern databases for later use either in 
symbolic or explicit-state heuristic search. Different 
to a posterior compression of the state set (Felner et al. 
2007), the construction itself works on a compressed 
data structure. Symbolic pattern databases are relations 
of pairs (f,x), which are satisfied if the heuristic estimate 
of a states encoded in x matches the heuristic value 
encoded in f. Such relations can be represented as a 
BDD for the entire problem space or kept partitioned 
in form of breadth-first search layers Heur[0], ... , 
Heur[k]. This list is initialized with the abstracted goal 
and, as long as there are newly encountered states, the 
set of predecessors with respect to the abstract transi- 
tion relation is generated. For constructing symbolic 
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shortest-path pattern databases, Dijkstra's algorithm 
can be adapted. 

Symbolic Version of A* 

Given a consistent heuristic h with c(n,n')-h(n)+h(n') 
> for all states n and n\ A* (Hart et al. 1968) is 
in fact a variant of Dijkstra's algorithm using an 
initial offset f(s):=h(s) and a refined update f(n') := 
min{f(n'),f(n)+c(n,n')-h(n)+h(n')}. 

BDDA* integrates symbolic and A* search 
(Edelkamp and Reffel 1998). It determines the suc- 
cessors of the set of states with minimum f-value in 
one evaluation step. Optimality and completeness of 
BDDA* are inherited from explicit-state A*. Starting 
from solving single-agent challenges BDDA* has been 
applied to hardware verification (Reffel and Edelkamp 
1999) and planning problems (Edelkamp 2002). The 
efficiency has been reproduced by Hansen et al. (2002) 
and Qian and Nymeyer (2003). 

To avoid finite domain arithmetics with BDDs, 
Jensen et al. (2002) has suggested a two-dimensional 
layout of state sets, one for each possible g- and h- value 
pair (see Figure 3). The advantage is that each state set 
already has the g- and the h-value attached to it, and the 
computation of the f-values for the set of successors 
are no longer needed. In the extension of BDDA* to 
weighted actions all successors of the set of states with 
minimum f-value, current cost g and action cost c can 
be determined individually. 

For computing the image, constructing a monolithic 
relation Trans is not mandatory. Given that sub-relations 
Trans[a] are linked to every action a e { 1, . . .,k} the im- 
age of state set partitions into {3x (Trans[l](x,x') a 
States(x))} v ... v {3x (Trans [k](x,x') a States(x))}. 



estimate) turns out to be very effective. Distributed 
search may additionally save memory on the individual 
computing node. Successful explorations with up to 
16 GB main memory and 3 TB disk space have been 
reported by Edelkamp and Jabbar (2007). 

More advanced symbolic data structures can also 
cover infinite state sets. The exploration algorithms 
share similar algorithmic principles but have to be 
adapted. A recent proposal by Borowski and Edelkamp 
(2006) considers symbolic search in infinite-state sys- 
tems with automata theory, where state sets and actions 
are represented as minimized finite state automata. The 
search repeatedly applies specialized image operators 
to compute the automata for the successor sets. 

Another important future application area for sym- 
bolic search is the classification in general game play 
(Love et al. 2006), e.g. in the area of two-player games. 
First algorithms have been provided by Edelkamp and 
Kissmann (2007), which compute strategies in form of 
BDDs, assuming optimal play. Once computed, BDDs 
serve as finite-state controllers. 



CONCLUSION 

Symbolic search is an apparent option for the space- 
efficient traversal of state spaces bypassing the explicit 
memory-consuming representation of the state sets. The 
essentials of symbolic exploration in finite state systems 
with state sets that are represented as Boolean functions 
in form of BDDs have been presented, and various 
algorithms as well as their refined implementations 
including symbolic uni- and bidirectional breadth-first, 
single-source shortest paths, as well as heuristic search 
with pattern databases have been discussed. 



FUTURE TRENDS 

Symbolic search algorithms with BDD are also effective 
in solving non-determinstic search, and probabilistic 
search problems, see initial work by Cimattietal. (1998) 
orHoey etal. (1999). Recent trends include temporally 
extended goals as analyzed by Lago et al. (2002). 

In order save main memory, all but the currently 
expanded BDDs can be flushed to disk. As both ex- 
plorations work on sets of states, a combination of 
disk-based search (to save RAM for the exploration) 
and symbolic pattern databases (to save RAM for the 
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KEY TERMS 

Action Planning: Refers to a world description in 
logic, where a number of atomic propositions describe 
what can be true or false in each state of the world. By 
applying operators to a world, one arrives at another 
world, where different atoms might be true or false. 
Usually, only few atoms are affected by an action, and 
most of them remain the same. 

Duplicate Elimination Scope: The number of layers 
that a back edge in a breath-first (or best-first) search 
graph can cross. It is an important parameter for the 
design of memory-limited frontier search algorithms 
and has application to improve the efficiency of both 
symbolic and disk-based search. 

Model Checking: For a system model together with 
a formal description of a property, model checking is 
a push-button decision procedure. In case the desired 
property is not satisfied by the model, it returns a coun- 
ter-example in form of a trace. Among the options for 
the specification of the model, there are Kripke struc- 
tures and labelled transition systems. Valid choices for 
property specifications are linear and branching time 
logics, or the propositional p-calculus. 



Pattern Database: Given that state in a search 
problem is described as a vector of state variables, 
pattern variables denote a subset of them. They define 
an abstraction such that any path in the concrete state 
space induces a path in the abstract one. A pattern is a 
specific assignment of values to the pattern variables. 
A pattern database completely evaluates the abstract 
search space prior to the base level search in form of 
a lookup table indexed by the abstract containing the 
shortest goal distance. 

Pattern Packing: Solves the pattern selection 
problem for constructing pattern database search heu- 
ristics. One bin represents a container for the abstract 
state space and approximates the memory usage for 
pattern database construction. Multiple bins apply for 
disjoint pattern database construction. In difference 
to standard bin packing, the effect of the selection of 
patterns is multiplicative. 

Relational Product: Specialized procedure that 
combines conjunction and variable quantification in 
one specialized BDD operation. 
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INTRODUCTION 

Many different synthetic neuron implementations 
exist, that include a variety of traits associated with 
biological neurons and our understanding of them. An 
important motivation behind the studies, modelling 
and implementations of different synthetic neurons, 
is that nature has provided the most efficient ways of 
doing important types of computations, that we are 
trying to mimick. 

Whether it is Artificial Neural Networks (ANNs) 
or other mixed signal systems, technology has always 
evolved in the direction of lower energy per unit 
computation ( Mead, 1990 ). Simple Neuron models 
as threshold elements, or perceptrons, are promising 
candidates for implementing future signal processing 
systems, including CMOS and SET ( Schmid & Leb- 
lebici, 2003 ), ( Beiu & Ibrahim, 2007 ). 

In this article a small number of published sub- 
threshold, ultra low power, perceptrons / threshold 
elements are compared regarding power consumption, 
operational speed and defect tolerance. The "mir- 
rored" gate operating in subthreshold and combined 
with redundancy, might be an interesting candidate 
for implementing artificial neural networks as well as 
other mixed-signal processing circuitry. 

Previously unpublished results demonstrate the mir- 
rored gate producing appropriate binary outputs at 180 
mV supply voltage, even when a transistor was cut off 
the supply voltage, for a redundancy factor of 2, using 
shorted outputs, as in ( Aunet & Hartmann, 2003 ). 



BACKGROUND 



the scope of this paper is limited to simple CMOS, ultra 
low power circuit topologies. Subthreshold circuits ( 
Swansson & Meindl, 1972 ), using a supply voltage 
below the inherent threshold voltage of the transistors, 
consume less power than other low power circuits ( 
Soeleman, Roy & Paul, 2001 ). Therefore we look at 
subthreshold neuron ("perceptron") implementations 
in this paper, and concentrate on different metrics in- 
cluding circuit complexity, operational speed, power 
consumption and defect tolerance. 

Reducing the power supply voltage through using 
ever more modern CMOS technologies and subthresh- 
old operation reduces the number of inputs one could 
use for the threshold elements, depicted in Figure 1 ( 
Aunet, 2002 ). Also, since only 2 inputs is optimal to 
implement any arbitrary neural network ( Beiu & Ma- 
karuk, 1998 ) we have restricted the treatment to basic 
building blocks having a maximum fan-in of 3. 

The first simple mathematical model of the bio- 
logical neurons, published by McCulloch and Pitts in 
1943, calculates the sign of the weigthed sum of inputs. 
Sometimes such circuits are called threshold logic gates 
or threshold elements, illustrated in Figure 1. Such per- 
ceptrons may be used to implement Neural Networks 
as well as digital signal processing. For a review on a 
wide range of VLSI implementations the reader might 
confer ( Beiu, Avedillo & Quintana, 2003 ). 



Figure 1. The binary output, Y, depends on if the 
weighted sum of inputs XI, X2, ...,Xn exceeds a certain 
Threshold, T 



CMOS has been the dominant technology for imple- 
menting signal processing systems for decades, and 
will probably live alongside other nanotechnologies 
for a long time ( ITRS, 2005 ). Due to needs for low 
power operation for about any future signal processing 
technology and that CMOS and similar technologies 
probably will be mainstream for the foreseeable future, 
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ULTRA LOW POWER NEURONS, 
SPEED AND RELIABILITY 

The main focus is on different subthreshold ultra low 
power perceptrons and how they compare regarding 
power consumption, operational speed and reliabil- 
ity. 

MOS Transistors in Subthreshold 

For an NMOS transistor in subthreshold we have 
(Andreou, Boahen, Pouliquen, Pavasovic, Jenkins & 
Strohbehn, 1991): 



L 



V 



.(KVgs/Vt) p ((l-K)Vbs/Vt), 



(l-e ( - vds/vt) +V ds /V ) 



I dsn expresses the current from drain to source. 
I is the zero-bias current where the pre-exponential 
constants have been absorbed. This includes the chan- 
nel width ("W") and the length ("L") of the MOSFET 
structure. V s is the gate-to-source potential, V ds the 
drain-to-source potential and Vbs the substrate-to- 
source potential. 

V is the Early voltage, which is proportional to 
the channel length, k gives the effectiveness for which 
the gate potential is controlling the channel current. 
It is often approximately 0.7-0.75 (Andreou, Boahen, 
Pouliquen, Pavasovic, Jenkins & Strohbehn, 1991). 
The thermal voltage is expressed as V t =kT/q. V t = 25.8 
mV at room temperature. 

A similar equation apply to PMOS transistors, but 
with opposite polarities. Exponential relationships 



between voltages between several nodes and the cur- 
rent level mean that subthreshold circuits also have 
operational speed and power consumption that are 
extremely dependent on the supply voltage, V dd . For 
example when operated at 10 kHz a subthreshold cir- 
cuit used four orders of magnitude less than a regular 
strong inversion circuit implementing the same function 
( Soeleman, Roy & Paul, 2001 ). 

Low Fan-In Subthreshold Threshold 
Element ("Neuron") Circuit 
Implementations 

Recently published circuits are shown in Figure 2. The 
"mirrored gate" is a static CMOS solution ( Beiu, Au- 
net, Nyathi, Rydberg III & Djupdal, 2005 ), based on ( 
Hampel D., Prost K. J. & Scheinberg N. R., 1974 ). The 
floating-gate solution P3N3 ( Aunet, 2002 ) might not 
go well along with future standard CMOS due to gate 
leakage, while the "ijcnn" (Aunet, Oelmann, Abdalla 
& Berg, 2004) and "stacked" ( Aunet, Berg & Beiu, 
2005 ) gates are CMOS. 

Metrics Regarding Power Consumption 
and Maximum Operational Speed 

Recently published results are shown in Figure 3 ( 
Granhaug & Aunet, 2006 ). The "mirrored", "ijcnn" 
and "stacked" gates were used for implementing 1-bit 
addition, Full Adders, in a 90 nm CMOS technology, 
and compared to a standard CMOS implementation 
(upper right corner in Figure 4). 



Figure 2. Experimental setup for statistical simulation of 1 -bit adder From left to right they are called "mir- 
rored", "P3N3", "IJCNN" and "Stacked" threshold elements. 
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In the upper left corner one can see that exploiting 
the "ijcnn" gate lead to the highest power consump- 
tion, while the "stacked" gate gave the lowest power 
consumption among the four implementations. Spice 
simulations were performed using a Cadence SW 
environment. The "mirrored" gate implementation 
resulted in a power consumption of 1160 pW, while 
the standard CMOS implementation had a slightly 
lower power consumption of 932 pW according to 
the simulations. 

Regarding the maximum operational speed, shown 
upper right in Figure 3, the implementation based on the 
"stacked" gate ( Figure 2.) could not compete, while the 
standard, "mirrored" and "ijcnn" gate implementations 
had delays of 162 ns, 159 ns and 174 ns, respectively. 
That the "mirrored" gate was slighly faster than the 
"ijcnn" gate is different to the findings in another 
publication ( Aunet & Berg, 2005 ), where the "ijcnn" 
gate lead to a delay of 5.2 us, nearly twice as fast as 
the "mirrored" gate with it's 9.15 us, implemented in 
a 120 nm CMOS technology and simulated operating 
at a supply voltage of 100 mV. 

Manufacturability including Defect 
Tolerance 

According to the ITRS Roadmap ( ITRS, 2005 ), 
reducing the overall power consumption and increas- 



ing the manufacturability are the two most important 
among five grand challenges for future nanoelectronics. 
Included in manufacturability are the possibilities to 
cope with a drastically increasing number of on-chip 
defects, as well as parameter variations. We have 
therefore included some data on the yield, meaning 
the expected percentage of circuits working under 
statistical variations in process parameters ( Aunet & 
Otnes Berge, 2007 ). 

A redundancy scheme for defect tolerance (Aunet & 
Hartmann, 2003 ) was used, exploiting shorted driven 
nodes and a redundancy factor of only 2 ( R = 2 ), in- 
stead of redundancy factor 3 and a majority voter. Full 
Adders based on the three threshold elements depicted 
in Figure 2 were used, as well as the standard Full Ad- 
der ( "FA" ) in the upper right corner of Figure 4. A 
typical digital chip will fail if even a single transistor 
is defective ( Mead, 1990 ). The results shown in the 
lower left corner of Figure 3 ( Aunet & Otnes Berge, 
2007 ) indicate that the solution based on R = 2 and 
the standard CMOS FA, as well as the solution based 
on the "mirrored" gate, should have a supply voltage 
above 150 mV if the implemented circuitry should 
expect a 100 % yield under the 90 nm CMOS produc- 
tion process variations. 

When small size transistors were used here, for all 
four solutions, the "ijcnn" and "stacked" gates could not 
be expected to give a satisfying yield at supply voltages 




Figure 3. Comparisons regarding power consumption, delay, tolerance to certain defects and complexity of 
circuitry when different threshold elements are used for 1-bit addition. A standard CMOS implementation is 
included for comparisons. 
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of 125 mV and 150 mV, since the gross majority of the 
circuits would not be expected to work. 

Interconnect Challenges, and Neurons 
Leading to Simpler Circuitry 

The number of defects in future nano technologies, 
including CMOS, will increase drastically. ( Fortes, 
2003 ), ( ITRS, 2005 ). Defect tolerance must be part of 
about any system design ( Lehtonen, Plosila & Isoaho, 
2005 ). This include defects in interconnect and contacts. 
Few internal nodes generally reduce the amount of 
wiring and interconnect. From this viewpoint it might 
be preferable to have relatively few (driven) nodes as 
well as transistors. In this respect the "ijcnn" circuit is 
favorable among the four, as may be seen from figure 
3 (Granhaug & Aunet, 2006 ). 

Boolean functions that can be realized by neurons 
are called linear threshold functions, and they can 
implement any Boolean function ( Siu & Bruck, 1990 
). Using the linear threshold elements (perceptrons) 
for implementing such functions may save wires, 



contacts and transistors, as the number of gates nec- 
essary to implement these important functions are 
growing linearly when using threshold elements, but 
exponentially if Boolean logic is used (Siu & Bruck, 
1990). And: Regardless of the number of bits, Boolean 
gates will never lead to a lower number of gates than 
threshold elements. 

This is also illustrated in figure 4, showing that 
the implementations (lower) depending on threshold 
elements / perceptrons are simpler and more regular 
than the traditional implementations, especially the 
floating-gate implementation based on pure Boolean 
logic, in the upper left corner. 

As device and interconnect parameter variations 
imposes more problems in future nanoscale technol- 
gies it can also be of importance to have increasingly 
regular stuctures on chip to reduce for example dopant 
fluctuations that degrade performance due to transistor 
threshold voltage variations. In this respect percep- 
tron implementations sometimes can be favorable, 
as illustrated in Figure 4, where the two lowermost 
schematics have a considerably higher regularity than 



Figure 4. Four schematics showing circuits implementing SUM' and CARRY' for binary addition 
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the two others. This will reflect on the layout and the 
physical parameters. 

"Mirrored" Gate in 90 nm CMOS 
Computing CARRY ■ Function at V dd = 
180 mV 

Figure 5 is showing measured results from an imple- 
mentation of the "mirrored" gate computing the mi- 
nority 3 function at 180 mV, both with and without a 
stuck-open fault, for a redundancy factor of R = 2. 16 
binary input vectors [X,Y,Z] = 000, 000, 001, 001, 010, 
010,..., Ill, 111 were applied. The circuits computed 
the correct logic levels in all cases, producing a low 
output if, and only if, 2 or 3 out of three inputs were 
high. A low signal is less than 0.25 times the supply 
voltage of 180 mV, while a high signal should be at 
least 0.75 times the supply voltage. 

This supply voltage is comparable to the low volt- 
age of 175 mV published in ( Miyazaki, Kao & Chan- 
drakasan, 2002 ). Reducing the power supply voltage 
is the most direct and dramatic means of reducing the 
power consumption. 



Discussion 

Though the "stacked" gate had the lowest power 
consumption, it is the slowest. It does not have a low 
number of internal nodes and wires, which in addition 
to a relatively low yield according to the statistical 
Monte Carlo simulations might not make it the best 
candidate among the 4 implementations. 

The "ijcnn" gate competes well regarding opera- 
tional speed, but is the most power-hungry. On one 
hand it has the simplest topology, but the standard 
CMOS implementation as well as the "mirrored" gate 
show far better yields. 

The "mirrored" gate implementation of the Full 
Adder competes well with the standard CMOS imple- 
mentation overall ( Granhaug & Aunet, 2006 ). These 
solutions might be interesting for further comparisons, 
as recent results have pointed out the "mirrored gate" as 
an interesting alternative, examplified by the following 
title; Why Inverters and Small Fan-In Voters are The 
Most Promising Gates for Future Nanoelectronics ( 
Beiu & Ibrahim, 2007). 




Figure 5 . Measured results 
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FUTURE TRENDS 

The assumption that a system is composed largely of 
correctly functioning units is no longer true in emerg- 
ing nanoelectronics, and reducing the overall power 
consumption is also among the grand challenges for the 
future. The low fan-in perceptrons, also called voters, 
or minority gates, might be very useful candidates for 
future nanoelectronics, which has been recently stated 
( Beiu & Ibrahim, 2007), and is not in disagreement 
with results presented here. Defect tolerant subthresh- 
old perceptron circuits exploiting majority gates, as 
presented here, may thus become very useful. 

Perceptrons may also prove to be useful building 
blocks for other future nanotechnologies, including 
SET ( Beiu, Avedillo & Quintana, 2003 ). 



CONCLUSION 

We have argued that future needs for low power 
consumption and defect tolerance make subthreshold 
neuron implementations useful for artificial neural 
network. From the results and discussion presented 
here we conclude that the "mirrored" gate may be 
particularly useful in the mentioned respects. It may 
be combined with redundancy for defect tolerance and 
increased manuf acturability, which could be useful for 
future nanotechnologies. 

The ability of the "mirrored" gate to function under 
the presence of defects, when exploiting redundancy, 
was demonstrated by chip measurements for a 90 nm 
CMOS technology. 
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KEY TERMS 

Full Adder: Circuit that produces the binary sum 
and carry when adding two binary numbers. 

Mismatch: Ideally identically constructed elements 
on an integrated circuits have a mismatch when they 
differ in their physical properties after production of 
the chip. 

Minority-3 Gate: A minority 3 gate outputs a logic 
"0" signal if, and only if, 2 or 3 out of it's three binary 
inputs are "1". 

Monte Carlo Simulations: Computer simula- 
tions basing the results on statistical distribution of 
parameters. 

Nanoscale CMOS: CMOS technologies where 
dimensions smaller than 100 nm is critical to the func- 
tioning of the produced chip. 

Neuron: Electrically excitable cells in the nervous 
system that process and transmit information. 

Parameter Variations: Parameters describing 
physical traits of integrated circuits may have variations 
due to mismatch, for example the threshold voltages 
of transistors. 

Perceptron: Type of artificial (feedforward) Neural 
Network. 

SET: Single Electron Transistor. 

Yield: In this paper the term yield refers to the ratio 
of functional circuits to the total number of simulated 
circuits. Often yield refers to the ratio of functional 
chips to the total number of manufactured chips. 
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INTRODUCTION 

In the field of Natural Language Processing, one of the 
very important research areas of Information Extraction 
(IE) comes in Named Entity Recognition (NER). NER 
is a subtask of IE that seeks to identify and classify the 
predefined categories of named entities in text docu- 
ments. Considerable amount of work has been done on 
NER in recent years due to the increasing demand of 
automated texts and the wide availability of electronic 
corpora. While it is relatively easy and natural for a hu- 
man reader to read and understand the context of a given 
article, getting a machine to understand and differentiate 
between words is a big challenge. For instance, the word 
'brown' may refer to a person called Mr. Brown, or the 
colour of an item which is brown. Human readers can 
easily discern the meaning of the word by looking at 
the context of that particular sentence, but it would be 
almost impossible for a computer to interpret it without 
any additional information. 

To deal with the issue, researchers in NER field 
have proposed various rule-based systems (Wakao, 
Gaizauskas & Wilks, 1996; Krupka & Hausman, 1998; 
Maynard, Tablan, Ursu, Cunningham & Wilks, 2001). 
These systems are able to achieve high accuracy in 
recognition with the help of some lists of known named 
entities called gazetteers. The problem with rule-based 
approach is that it lacks the robustness and portability. 
It incurs steep maintenance cost especially when new 
rules need to be introduced for some new information 
or new domains. 

A better option is thus to use machine learning 
approach that is trainable and adaptable. Three well- 
known machine learning approaches that have been 
used extensively in NER are Hidden Markov Model 
(HMM), Maximum Entropy Model (MEM) and Deci- 
sion Tree . Many of the existing machine learning-based 
NER systems (Bikel, Schwartz & Weischedel, 1999; 
Zhou & Su, 2002; Borthwick, Sterling, Agichten & 
Grisham, 1998; Bender, Och & Ney, 2003; Chieu & 
Ng, 2002; Sekine, Grisham & Shinnou, 1998) are able 
to achieve near-human performance for named entity 



tagging, even though the overall performance is still 
about 2% short from the rule-based systems. 

There have also been many attempts to improve the 
performance of NER using a hybrid approach with the 
combination of handcrafted rules and statistical models 
(Mikheev, Moens & Grover, 1999; Srihari & Li, 2000; 
Seon, Ko, Kim & Seo, 2001). These systems can achieve 
relatively good performance in the targeted domains ow- 
ing to the comprehensive handcrafted rules. Nevertheless, 
the portability problem still remains unsolved when it 
comes to dealing with NER in various domains. 

As such, this article presents a hybrid machine learn- 
ing approach using MEM and HMM successively. The 
reason for using two statistical models in succession 
instead of one is due to the distinctive nature of the two 
models. HMM is able to achieve better performance than 
any other statistical models, and is generally regarded as 
the most successful one in machine learning approach. 
However, it suffers from sparseness problem, which 
means considerable amount of data is needed for it to 
achieve acceptable performance. On the other hand, 
MEM is able to maintain reasonable performance even 
when there is little data available for training purpose. 
The idea is therefore to walkthrough the testing cor- 
pus using MEM first in order to generate a temporary 
tagging result, while this procedure can be simultane- 
ously used as a training process for HMM. During the 
second walkthrough, the corpus uses HMM for the 
final tagging. In this process, the temporary tagging 
result generated by MEM will be used as a reference 
for subsequent error checking and correction. In the 
case when there is little training data available, the final 
result can still be reliable based on the contribution of 
the initial MEM tagging result. 



BACKGROUND 

Message Understanding Conference 

In 1987, the Naval Ocean Systems Center (NOSC), 
which is presently known as the Naval Command, 
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Control and Ocean Surveillance Center, initiated the 
first Message Understanding Conference (MUC). 
Subsequently, a series of MUCs had been held and 
designed to promote and evaluate research in IE. The 
evaluations achieved through these MUCs have led the 
research program in IE until its present state. 

In 1995, goals and tasks were set up for MUC-6 
to make the IE system more practical with an aim to 
achieve automatic performance with high accuracy. 
"Named Entity" was then developed to help identifying 
the names of persons, organizations, and geographic 
locations in a text. Since then, the NER tasks have 
become a central theme in MUC (see Chinchor, 1995 
and Chinchor, 1998 for more details). 

According to the specifications defined by MUC, 
the NER tasks generally work on seven types of named 
entities as listed below with their respective markup: 

PERSON (ENAMEX) 
ORGANIZATION (ENAMEX) 
LOCATION (ENAMEX) 
DATE (TIMEX) 
TIME (TIMEX) 
MONEY (NUMEX) 
PERCENT (NUMEX) 

From the list above, three subtasks are derived from 
these seven types of named entities and assigned with 
three respective SGML tag elements, namely ENA- 
MEX, TIMEX and NUMEX. As TIMEX and NUMEX 
are fairly easy to predict with some effective finite state 
methods (Roche & Schabes, 1997), most of the current 
research deals only with ENAMEX which are highly 
variable and ambiguous. 

Previous Approaches 

Since MUC-6 and MUC-7, many NER systems have 
been proposed and proven to be successful in their 
targeted domains. In general, NER systems that use 
handcrafted rules still lead the way, with the highest 
F-measure score up to 96.4% achieved in MUC-6 as 
compared to the statistical approaches that were able 
to achieve 94.9% (Zhou & Su, 2002). 

In rule-based approach, a set of rules or patterns is 
defined to identify the named entities in a text. These 
rules or patterns consist of distinctive word format, 
such as capitalization or particular preposition prior 
to a named entity. For instance, a capitalized string 



behind titles such as 'Mr', 'Dr', etc will be identified 
as name of a person, whereas a capitalized word after a 
preposition such as 'in', 'at', 'near', etc is most likely to 
be a location. By implementing a finite set of carefully 
predefined pattern matching rules, the named entities 
within a text could be found systematically. 

There have been substantial amount of works done 
using the rule-based approach. One of the very well 
documented systems that followed the direction of 
this approach was the framework of the LaSIE System 
reported by Wakao et al. (1996). Another well-known 
example of rule-based system can be found in the 
IsoQuest's NetOwl Text Extraction System presented 
by Krupka and Hausman (1998). Meanwhile, Diana 
Maynard et al. (2001) had also built an NER system 
based on handcrafted rules that is able to achieve an 
average of 93% precision and 95% recall across di- 
verse text types. 

Statistical approach, on the other hand, works by 
using a probabilistic model containing features to the 
data which are similar to the rule-based approach. The 
features of the data, which could be understood as 
rules set for the probabilistic model, are produced by 
learning the resulting corpora with correctly marked 
named entities. The probabilistic model then uses the 
features to calculate and identify the most probable 
named entities. As such, if the annotated features of 
the data are truly reliable, the model would have a 
high probability in finding almost all the named enti- 
ties within a text. 

In the last decade, large amount of works in NER 
have been done using the statistical approach based on 
some very large corpora. The MEM, one of the most 
popular statistical models, has been applied frequently 
in various NER tasks. One significant account on MEM 
is the MENE system reported by Borthwick et al. 
(1998). In their system, they used four main features 
to identify the named entities, which they referred to 
as the binary features, lexical features, section features 
and dictionary features. 

The binary features in MENE system basically deal 
with capitalization in the text. Meanwhile, lexical fea- 
tures are concerned with the lexical terms such as list of 
words and their types which are used with a grammar. 
Section features indicate a current section of the text, 
whereas the dictionary features make use of a broad 
array of dictionaries of single or multiple terms such 
as first names, organization names, corporate suffixes, 
etc. The dictionary features are similar to the gazetteers 
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used for rule-based systems, except that dictionaries in 
MENE system require no huge maintenance effort. 

Nevertheless, using MENE system alone on the 
MUC-7 test data as reported in Borthwick et al. (1998) 
achieved only an F-measure of 84.22%. For MENE 
system to work better, Borthwick et al. combined 
MENE with other rule-based approaches in order to 
achieve superior results. 

Besides Borthwick et al., Bender et al. (2003) also 
reported on an NER system that was able to achieve an 
F-measure score of 89.58% by using MEM. With an 
annotated corpus and a set of features, they first built 
a baseline named entity recognizer which was then 
used to extract the named entities and their contextual 
information from non-annotated data. The accuracy of 
their system was further improved with a final recog- 
nizer that made use of the trained data. 

Another MEM-based system can be found in Chieu 
and Ng (2002). They presented a system called MEN- 
ERGI that made use of global information with just 
one classifier, and showed that their system was able to 
achieve performance comparable to the best machine 
learning-based systems in MUC-6 and MUC-7. 

Apart from MEM, HMM is another well-known 
statistical model that has been used frequently in various 
NER systems. The IdentiFinder reported by Bikel et al. 
(1 999) using a modified HMM was the best-performer 
on the official MUC-6 and MUC-7 test data among 
all the machine learning-based systems. IdentiFinder 
employed similar features to those of MENE system, 
and depended on statistics to make decision in iden- 
tifying the named entities. It is different in a way that 
it has a complete probabilistic model that governs all 
decisions in classifying the named entities and models 
the categories of interest and the residual input that is 
not of interest. 

The modified HMM used by IdentiFinder was 
subsequently adopted by Zhou and Su (2002). In their 
work, they were able to increase the performance of 
their NER system dramatically by introducing four 
sub-features with back-off modelling. Using the test 
data from MUC-6 and MUC-7, their system was able 
to achieve F-measure scores of 96.6% and 94.1% 
respectively. 

Many more previous works were done using sta- 
tistical models other than MEM and HMM. There are 
also many NER systems that use a hybrid approach by 
combining the statistical models with some rule-based 
learning techniques. One very successful example can 



be found in the work of Mikheev et al. (1999), where 
they used substantial handcrafted rules together with 
MEM for partial matching. Observation on the previous 
approaches, however, shows that no system has ever 
tried to use MEM and HMM successively. 



THE HYBRID APPROACH 

As mentioned before, the NER system presented in this 
article uses two statistical models - MEM and HMM 
- in succession. The MEM is based on the MENE 
system reported by Borthwick et al. (1998) whereas the 
HMM is based on the IdentiFinder reported by Bikel 
et al. (1999). The system is built with Java using the 
existing implementation from the JavaNLP repository 
which is available at http://nlp.stanford.edu/javanlp/. 
For training and experimental purposes, British National 
Corpus (BNC) which contains texts that are diverse in 
terms of domain, style and genre has been chosen to be 
the testing corpus. This is to ensure that the proposed 
NER system is domain-independent and can adequately 
cope with a variety of text types. 

Maximum Entropy 

By following the guidelines from MUC-6 and MUC-7 
for the definition of the NER task, every word from the 
corpus is tokenized and assigned to a desired category of 
named entity with the tag of either "person" (<PER>), 
"organisation" (<ORG>) or "location" (<LOC>). MEM 
is first used to estimate the probability of a given word 
being fallen into one of the three categories mentioned 
based on a set of features and some training data. Two 
special conditions are taken into consideration when 
a word falls at the beginning (<START>) and at the 
end (<END>) of a sentence. In the case when a given 
word does not fall into any of the desired categories, 
empty tag (< >) will be placed to indicate that the word 
belongs to none of the desired categories. 
For the purpose of finding named entities, the maximum 
entropy estimation process uses a model that is described 
below to compute the conditional probability P for all 
tags t based on the history h, in which every feature f. 
is associated with it a weighting parameter a.: 
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It is necessary to note that the history h mentioned 
in the model refers to all the conditioning data that 
enable the system to make a decision on the tagging 
process. It comprises of all information derivable from 
the corpus relative to a token whose tag the system is 
trying to determine, may it be the word itself or the 
features. The product of the weightings for all features 
active on h will then be calculated, and eventually be 
divided by a normalization function, ZJJm). 

Hidden Markov Model 

After the MEM walkthrough, all the tagged named 
entities in the testing corpus are used as training data 
for HMM to make the final tagging. Since there will 
be sufficient training after parsing through the corpus 
using MEM, it is not necessary for the system to use 
the back-off models such as those used by Bikel et al. 
(1999) and Zhou and Su (2002). 

In this system, HMM is used mainly for global 
context checking, that is to check the occurrences 
of the same named entity in different sections of the 
same text document. Checking the context from the 
whole document is important as this will ensure the 
consistency of the tagged named entities and resolve 
some ambiguous cases. For instance, an organization's 
name is often abbreviated especially when it has al- 
ready been mentioned somewhere in a document. By 
checking the global information, the abbreviation as 
an organization can be identified. Besides that, there 
are also some entities that are highly ambiguous, and 
their categories cannot be determined without taking the 
global context into consideration. The phrase ' Honda 
City' in sentences such as "Honda City is nice" or "Pro- 
motion for Honda City" could easily be misinterpreted 
as a location based on the local contextual evidence, 
unless there is another sentence that sounds like "I am 
driving Honda City". 

Similar to the previously used MEM, HMM is used 
to compute the likelihood of words occurring within a 
given category of named entity. Every tokenized word 
is now considered to be in ordered pairs. By using a 
Markov chain, the likelihood of the words is calculated 
simply based on the previous word. 

For classifying the named entities, the system finds 
the most likely tag t for a given sequence of words w 
that maximizes P(t\w). The occurrences of the given 
events are counted throughout the whole text based on 



the calculation below: 



P(t | L^Wj: 



countfat^w^) 

count(t_ 1 ,w_ 1 ) 




Finally, a classifier is used to correct the errors in 
the results derived from MEM to perform the final 
tagging process using HMM. 

Experimental Results 

The proposed system has been tested with articles from 
BNC based on a wide range of domains from different 
fields. With the successive use of MEM and HMM, it is 
able to maintain a desirable performance regardless of 
the size of training data. In overall, the system achieved 
F-measure scores above 95% consistently for most of 
the commonly used domains. A detailed description of 
the system and its experimental results can be found 
in Chiong and Wang (2006). 



FUTURE TRENDS 

While the preliminary results on the hybrid approach 
have been quite positive, the proposed system is 
still fairly immature. Much work needs to be done 
to make the performance of the system more robust. 
For instance, it will be interesting to see how more 
sophisticated features can be incorporated to improve 
the performance of the system. It will also be interest- 
ing to see how the system can be trained on corpora 
in foreign languages. 

In the future, it is anticipated that the proposed ap- 
proach can be valuable in various Natural Language 
applications. One immediate contribution can be seen 
in automating the arduous task of ontology building. 
Meanwhile, Automatic Text Summarization Systems 
can also be enriched by the proposed system, as named 
entities are able to provide clues for identifying relevant 
segments in text. Last but not least, the proposed ap- 
proach is expected to help in building more accurate 
Internet search engines too. 
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CONCLUSION 

This article presented a hybrid machine learning 
approach that used MEM and HMM successively. 
With the preliminary data training through MEM and 
appropriate classifier for error correction in the final 
recognition process through HMM, the performance 
of the proposed NER system can be greatly enhanced 
as compared to using only a single statistical model. 
Moreover, the system is also able to adapt to different 
domains without human intervention, and maintained 
desirable performance regardless of the size of the 
training corpus. 
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KEY TERMS 

British National Corpus (BNC): A 100 million 
word collection of samples of written and spoken 
language from a wide range of sources, designed to 
represent a wide cross-section of current British English, 
both spoken and written. 

Gazetteer: Alist of named entities with the names of 
persons, organizations, locations, expressions of times, 
quantities, monetary values, percentages, etc. 

Hidden Markov Model (HMM): A statistical 
model for determining the hidden parameters based on 
the observable parameters using probability distribution 
in order to perform analysis and pattern recognition. 

Information Extraction (IE): A process of select- 
ing information from a dataset or text based on certain 
specifications using templates, and it is often delivered 
in the form of fragments of documents. 

Machine Learning: An area of study concerning 
the development of techniques which allow machines 
or computers to improve their performance based on 
previous results and learning experience. 



Maximum Entropy Model (MEM): A statistical 
model for analyzing the available information in order 
to determine a unique epistemic probability distribution 
based on partial information about the probabilities of 
possible outcomes of an experiment, and chooses the 
probabilities so as to maximize the uncertainty about 
the missing information. 

Named Entity Recognition (NER): A subtask of 
IE that seeks to identify and classify named entities in 
text into predefined categories such as the names of 
persons, organizations, locations, expressions of times, 
quantities, monetary values, percentages, etc. 

Natural Language Processing (NLP): An area of 
study concerning the problems inherent in the process- 
ing and manipulation of natural language with the aim 
to make computers understand statements written in 
human languages. 
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INTRODUCTION 

Recent advances in the applications of ANN have 
demonstrated successful cases in time series analysis, 
data mining, civil engineering, financial analysis, music 
creation, fishing prediction, production scheduling, 
intruder detection, etc., making them an important tool 
for research and development 1]. ANN and evolution- 
ary computation(EC) techniques have been employed 
successfully in solving real-world problems includ- 
ing those with a temporal component[2]. In another 
work[3], a hybrid method based on a combination of 
evolutionary computation and neural network(NN) has 
been used to predict time series. 

In the world of databases, various ANN-based 
strategies have been used for knowledge search and 
extraction [4]. Intelligent neural systems have been 
constructed with the aid of genetic algorithm-based 
EC techniques and these systems have been applied 
in breast cancer diagnosis[5]. Genetic algorithms(GA) 
have been applied to develop a general method of select- 
ing the most relevant subset of variables in the field of 
analytical chemistry to classify apple beverages [6] . New 
ANN methods enable civil engineers to use comput- 
ing in different ways. Besides as a tool in urban storm 
drainage [7], ANN and Genetic Programming(GP) have 
been implemented in the prediction and modelling 
of the flow of a typical urban basin [8]. In the latter 
case, it was shown that these two techniques could be 
combined in order to design a real-time alarm system 
for floods or subsidence warning in various types of 
urban basins. ANN models for consistency, measured 
by slump, in the case of conventional concrete have 
also been developed[9]. In a time series prediction of 
the quarterly values of the medical component of the 
Consumer Price Index(CPI), the results obtained with 
both neural and functional networks have been shown 



to be quite similar [10]. Dimensionality reduction, vari- 
able reduction, hybrid networks, normal fuzzy and ANN 
have been applied to predict bond rating[ll]. 

A recent online survey through the ISI Web of 
Knowledge using keywords such as "ANN" and 
"thermal design" would reveal only ten relevant SCI 
publications[12]. In the area of food processing, ANN 
was used to predict the maximum or minimum tem- 
perature reached in the sample after pressurization 
and the time needed for thermal re-equilibration[13]. 
The accurate determination of thermophysical proper- 
ties of milk is very important for design, simulation, 
optimization, and control of food processing such as 
evaporation, heat exchanging, spray drying, and so 
forth. Generally, polynomial methods are used for 
prediction of these properties based on empirical cor- 
relation to experimental data. However, it was found 
that ANN presented a better prediction capability of 
specific heat, thermal conductivity, and density of 
milk than polynomial modeling and it was suggested 
as a reasonable alternative to empirical modeling for 
thermophysical properties of foods [14]. 

Numerical simulation of natural circulation boiling 
water reactor is important in order to study its perfor- 
mance for different designs and under various off-de- 
sign conditions. It was found that very fast numerical 
simulations, useful for extensive parametric studies 
and for solving design optimization problems, can be 
achieved by using an ANN model of the system[15]. 
ANN models and GA were applied for developing 
prediction models and for optimization of constant 
temperature retort thermal processing of conduction 
heating foods[16]. ANN technique has been used as 
a new approach to determine the exergy losses of an 
ejector-absorption heat transformer (EAHT)[17]. The 
results show that the ANN approach has the advantages 
of computational speed, low cost for feasibility, rapid 
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turnaround, which is especially important during itera- 
tive design phases, and easy of design by operators 
with little technical experience. 

Computational fluid dynamics approach is often 
employed for heat transfer analysis of a ball grid 
array(BGA) package that is widely used in the mod- 
ern electronics industry. Owing to the complicated 
geometric configuration of the BGA package, an ANN 
was trained to establish the relationship between the 
geometry input and the thermal resistance output[18]. 
The results of this study provide the electronic packag- 
ing industry with a reliable and rapid method for heat 
dissipation design of BGA packages. Thermal spray- 
ing is a versatile technique of coating manufacturing 
implementing large variety of materials and processes. 
An ANN was developed to relate processing parameters 
to properties of alumina-titania ceramic coatings[19]. 
Predicted results show globally a well agreement with 
the experimental values. 

It can be seen that applications of ANN in thermal 
design is scarce and this article aims to explore the 
application of an ANN in gas-fired cooktop burner 
design. 



BACKGROUND 
Cooktop Design Goals 

Gases that trap heat in the atmosphere are often called 
greenhouse gases. They include carbon dioxide, nitrous 
oxide, methane, and ozone. Individuals can produce 
greenhouse gas emissions directly by burning oil or gas 
for home heating and cooking or indirectly by using 
electricity generated from fossil fuel burning. In the 
last 200 years, mankind has been releasing substantial 
quantities of greenhouse gases into the atmosphere. 
These extra emissions are increasing greenhouse 
gas(GHG) concentrations in the atmosphere, enhancing 
the natural greenhouse effect, which is believed to be 
causing global warming. 

To combat the global warming problem, gas sup- 
pliers and manufacturers of cooking appliances are 
trying to find ways of improving energy efficiency 
with reducing greenhouse gas emissions. In view of 
the number of controllable factors and responses to be 
studied, Design of Experiments(DOE) is often used 
for such kind of empirical investigations. The authors 
therefore proposed to combine DOE technique with 



the ANN approach for solving the multiple input and 
multiple output(MIMO) design problem. To achieve 
optimal thermal efficiency and greenhouse gas emis- 
sions, a back-propagation ANN was used to simulate 
the operating conditions and the implementation details 
are illustrated through a real-life case. 



EMPIRICAL STUDY 

A three factor, three level, Full Factorial Design(with 3 
repetitions) was employed to investigate the complex 
relationships of three design parameters of a cooktop 
burner, viz. the Reynolds number, the equivalence 
ratio, and the load-height(distance from nozzle to 
bottom surface of cookware). The range of Reynolds 
number(Re) considered varies from 400 to 700. Atailor- 
made cooktop burner with a ring of 128 mm diameter 
is used and circular nozzles (diameter = 6 mm) were 
used in the experiment. Fuel-rich flames (corresponding 
to equivalence ratios ranging from 1.4 to 1.8) similar 
to real-life cooking situations were employed in the 
experiments. The load-height ranges from 24 mm to 
32mm (corresponding to a H/d ratio varying from 4 
to 8). To allow for different spacing between groups 
of nozzles, four configurations of the cooktop burner 
were considered, viz. 2-nozzle 6-section, 3-nozzle 
4-section, 4-nozzel 3-section and 6-nozzel 2-section 
configurations. These nozzle configurations were la- 
beled 1, 2, 3 and 4 respectively(Fig.2). A total of 108 
experiments were carried out for each configuration. 
Experiment results showed that the configuration of 
3-nozzle 4 section based on the predetermined input 
conditions of Re=550, EqR=1.6 and H/d=8 would give 
the best thermal efficiency(62%) and acceptable CO 
and NOx emissions, 

Burner efficiency model 

The thermal efficiency of a burner is defined as the 
percentage of the thermal input transferred to the water 
in the loading vessel. It was determined by measuring 
the elapsed time for a standard 4 kg load of water to 
be heated from 30°C to 80°C and the corresponding 
consumption of LPG. Mathematically, the thermal ef- 
ficiency is calculated as: 




T] = (MCpAT/QH )*100% 



(1) 
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where, M(kg) is the load mass of water, Cp(kJ/kg°C) is 
the specific heat of water, Q(m 3 ) is the LPG consump- 
tion, AT(°C) is the temperature rise and H v (kJ/m 3 ) 
denotes the heating value of LPG.. An analysis of 
variance (ANOVA) on the data collected indicated that 
the quadratic regression model shown in Eqn.(2) could 
adequately describe the thermal efficiency (r|). Stepwise 
method was used to remove the insignificant terms. 

i] = 0.49-0.05A-0.038B+0.17C-0.039CD+0.021B 2 - 
0.18C 2 (2) 

Where, -1 < A, B, C < 1 corresponding to 

400 < Re < 700; 1.4 < Re < 1.8; 4 < H/d < 8 

The adjusted multiple correlation and the adequate 
precision coefficients were found to be 0.95 and 29.63 
respectively. 

Burner emission models 

Similarly, based on the values of Adjusted Multiple 
Correlation and Adequate Precision coefficients gener- 
ated from the ANOVA results, the regression models 
shown in Eqn.(3) and Eqn.(4) adequately describe the 
CO and NOx emissions. 

CO = 1790. 22 + 745. 27A+557.01B-1721.41C + 
280.57AB-2 51.9 7AC-39.19BC + 31.54A 2 - 
67.06B 2 +481.87C 2 (3) 

NOx = 51.10-6.99A-6.26B+23.90C+2.22AB-3.77AC- 
5.58BC (4) 

It is noted from the efficiency and emission models 
that the combined effects of some design factors can 
have considerable impact on the responses. 



ANN OPTIMIZATION OF COOKTOP 
DESIGN 

For this MIMO system, the architecture of the neural 
network can have several layers. Each layer has a 
weight matrix W, a bias vector b, and output vector 
a. Generally, a network of two layers, where the first 
layer is sigmoid and the second layer is linear, can be 
employed to approximate any function reasonably well 



and such structure was adopted in this case. The two 
layers of neurons with nonlinear transfer functions allow 
the network to learn nonlinear and linear relationships 
between input and output vectors. 

The output of i th neuron in the hidden layer is: 

al i =f 1 (Ewl ijPj + bl i ),i=l,2;j = l,2 

For the output of k th neuron in the output layer is: 
a2 k =f 2 (Ew2 kj aL + b2 k ),k=l,2;j = l,2 

The error function is defined as: 
E(W,B) = (l/2)^t k -a2 k ) 2 

Gradient Method was used for calculation of weight 
variation and the back-propagation of the error, when 
training the network. 

Weight Variation of the output model is: 

Aw2 kj = -i 1 (5E/5w2 kj ) = -T 1 (5E/5a2 k )(5a2 k /5w2 kj ) 

Ab2 kj = -T 1 (5E/5b2 kj ) = -T 1 (5E/5a2 k )(5a2 k /5b2 kj ) 

Weight variation of the hidden layer is: 

Awl k .= -T](5E/5wl k .) = -T 1 (5E/5a2 k )(5a2 k /5w2 kj )(5al/ 
5wl.) 

Abl. = ri5.. 

Before training, it is necessary to scale the inputs and 
targets so that they always fall within a specified range. 
In the neural network model above, it is found that if 
preprocess and postprocess procedures are omitted, the 
network can hardly achieve the designed goal. This is 
due to large magnitude differences between the input 
and output parameters, e.g. : efficiency is in the range of 
0.4 to 0.68, but CO emission(in ppm) is in the range of 
hundreds. All the input and output data was so scaled 
that they would fall within the range of [-1, 1]. The 
initialization of the weights and bias for each layer is 
important before training a feed-forward network. If 
improper values were chosen, training time would be 
excessive and the network could not convergent. As 
the characteristics of the actual operating condition are 
unknown, random initialization is used. 
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Figure 1. MSE trend 





For each cooktop burner configuration, 90 sets 
of experiment results were used to train the neural 
network, and the remaining 18 were used to evalu- 
ate whether the network can represent the model of 
cooktop well. Mean square error(MSE) was used to 
test the performance of the network. It was found that 
the neural network performed reasonably well for the 
testing data. 32 neurons were used in the hidden layer, 
and for an average training time of 200~300 epochs, 
MSE could easily dropped to 10 5 (Fig. 1). 

As the NN model performed quite well after train- 
ing by using the actual experimental data, it is used to 
simulate the real environment of the experiments. After 
a simulation of all the reasonable combinations of the 
three initial factors, a maximum value of the thermal 
efficiency (corresponding to low CO and NOx emis- 
sions) was found. 

There were four nozzle configurations in the simu- 
lation, the same as in the experiment. For each nozzle 
configuration, the number of simulations was: 



cooktop were found. The simulation results showed that 
while the 2-nozzle and 3 -nozzle configurations satisfied 
the National Standards, the 3 -nozzle configuration 
indicates slightly better results: the thermal efficiency 
reaches 62.8% and the CO and NOx emissions are 
257 and 91 respectively. Fig. 2 shows a comparison 
of thermal efficiency, CO and NOx emissions for each 
nozzle configuration respectively. 

To confirm the validity of the NN results, three 
additional experiments based on the optimum values 
were carried out. Table 1 shows a comparison of the 
predicted and observed responses for the 3-nozzle 
burner configuration under optimal conditions. It can 
be seen that the predicted values of efficiency and NO x 
emission are close to the experimental values. The dif- 
ference between the predicted and the observed values 
are about 15%. 



FUTURE TRENDS 



[(700-400)/5]*[(1.8-1.4)/0.05]*[(8-4)/l] = 1,920 

After the simulation, the first step was to choose 
the results that could meet the Chinese National Stan- 
dard on Gas Appliances: thermal efficiency >60%, 
CO emission <300(ppm), NOx emission <100(ppm). 
The second step was to find the maximum value of the 
thermal efficiency, from this value the corresponding 
combination of input factors and configuration type of 



Through this study one can see that artificial neural 
networks, as in other real-life applications, offer cooktop 
designers a highly versatile new tool for the thermal 
design of gas-fired cooktop burners. The successful 
utilization of this procedure requires that the ANN be 
fully optimized. In this study, backpropagation, a local 
search algorithm, was used for optimization and it is 
likely that local optimum rather than the global was 
obtained. One may attempt to address this problem 
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Figure 2. Comparison of ANN simulation results 
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with ad hoc procedures such as stopping optimization 
at sub-optimal solutions(not over-training) or adjusting 
the neural network architecture to make optimization 
easier. As shown in other engineering applications!!], 
a promising alternative would be the use of a global 
optimization algorithm such as the genetic algorithm. 



It is anticipated that by using a global search algorithm, 
the objective function can be set to balance the tradeoff 
between the over parameterization of the model that 
may over fit the data and a parsimonious ANN that can 
provide a more robust solution. 
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Table 1. A comparison of 'NN prediction and experimental results at optimal design 
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Re 


EqR 


H/d 


Efficiency 


CO (ppm@0%O 7 ) 


NOx (ppm@0%O 7 ) 


Predicted 


Observed 


Predicted 


Observed 


Predicted 


Observed 


580 


1.65 


8 


0.63 


0.60 


257 


292 


91 


98 



CONCLUSION 

A problem facing the gas-fired cooktop burner design 
community is the determination of design parameters 
which will result in a product with the most desirable 
combination of functional outcomes. Through an 
empirical investigation on the simultaneous optimiza- 
tion of thermal efficiency and GHG emissions, ANN 
methodology has proved to be an effective empirical 
modeling tool. Both the multiple correlation (R 2 ) and the 
mean square error (MSE) of ANN models for efficiency, 
CO and NO emissions indicated that the ANN results 

X 

were quite satisfactory. Through multiple regression 
modeling one can evaluate the significance of the main 
and combined effects of various design parameters 
on burner efficiency and emissions. The relationship 
between the gas-fired burner design parameters and 
performance is hence further understood. To enhance the 
optimizing capabilities of the proposed ANN, a global 
search algorithm such as GA should be used instead 
of the backpropagation algorithm. It is believed that a 
GA-based ANN would provide a practical tool for the 
mechanical engineering design community. 
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KEY TERMS 

Backpropagation: Asupervised learning technique 
used for training artificial neural networks. It is most 
useful for feed-forward networks (networks that have 
no feedback, or simply, that have no connections that 
loop). The term is an abbreviation for "backwards 
propagation of errors". Backpropagation requires that 
the transfer function used by the artificial neurons (or 
"nodes") be differentiable. 

Full Factorial Design: This design allows a designer 
to adequately quantify a response with a reasonable 
number of tests. In general, full factorial designs re- 
quire three levels for each factor thus allowing one to 
evaluate second order models. 

Designed of Experiments: A set of tests conducted 
under controlled conditions in which multiple levels 
of a set of factors are manipulated and the resulting 
response(s) of a system or process is measured or 
observed. 

Factors: The set of (independent) variables that 
are believed to affect the response of a system or 
process. 

Levels: The sets of values for each factor. 

Multilayer Perceptrons (MLPs): They are feed- 
forward neural networks trained with the standard 
backpropagation algorithm. They are supervised 
networks so they require a desired response to be 
trained. They learn how to transform input data into a 
desired response, so they are widely used for pattern 
classification. With one or two hidden layers, they can 
approximate virtually any input-output map. 

Multiple Correlation, R 2 : the percent of variance 
in the dependent variable explained collectively by all 
of the independent variables. 

Response: The (dependent) variable(s) measured 
or observed that is the result of a test conducted at a 
specific set of factor levels. 
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Reynolds Number (Re): the ratio of inertial forces 
to viscous forces and consequently it quantifies the rela- 
tive importance of these two types of forces for given 
flow conditions. Thus, it is used to identify different 
flow regimes, such as laminar or turbulent flow. 
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INTRODUCTION 



BACKGROUND 



Positron Emission Tomography (PET) is a radiotracer 
imaging technique based on the administration (typi- 
cally by inj ection) of compounds labelled with positron 
emitting radionuclides to a patient under study. When 
the radio-isotope decays, it emits a positron, which 
travels a short distance before annihilating with an 
electron. This annihilation produces two high-energy 
(511 keV) gamma photons propagating in nearly op- 
posite directions, along an imaginary line called Line 
of Response (LOR). 

In PET imaging, the photons emitted by the decay- 
ing isotope are detected with gamma cameras. These 
cameras consist of a lead collimator to ensure that all 
detected photons are propagated along parallel paths, 
a crystal scintillator to convert high-energy photons to 
visible light, photo-multiplier tubes (PMT) to trans- 
form light signals into electric signals, and associated 
electronics to determine the position of each incident 
photon from the light distribution in the crystal (Ollinger 
& Fessler, 1997). 

We have researched on how Artificial Neural Net- 
works (henceforth ANNs or NNs) could be used for 
bias-corrected position estimation. Small-scale ANNs 
like the ones considered in this work can be easily 
implemented in hardware, due to their highly paral- 
lelizable structure. Therefore, we have tried to take 
advantage of the capabilities of ANNs for modelling 
the real detector response. 



Traditionally, Anger logic (Anger, 1958) has been the 
most popular technique to obtain the the position of 
the centroid, or centre of the light distribution inside 
the scintillator crystal by means of a simple formula. 
The solution proposed by Anger involves connecting 
the PMT outputs to a simple resistor division circuit 
to obtain only four signals (X", X + , Y", Y + ). However, 
Anger logic introduces some important drawbacks in 
the detection process: non-uniform spatial behaviour, 
differences between each PMT gain or the deformation 
of the light distribution when it approaches the edge of 
the scintillator. These problems are alleviated by using 
correction maps. 

However, the presence of all these phenomena in 
traditional detectors still reduces the intrinsic resolu- 
tion and produces non-uniform compression artifacts 
in the image and the so called border effects. The main 
consequence is an unavoidable reduction of the Useful 
Field Of View (UFOV) of the PET camera, which usu- 
ally covers up to 60% of each crystal dimension. 

With other methods such as Statistics Based Posi- 
tioning (SBP) or Maximum Likelihood (ML) position- 
ing, this UFOV can be increased to approximately the 
80% of each dimension of the crystal, but these methods 
involve a heavier computational cost ( Joung, Miyaoka, 
Kohlmyer & Lewellen 2001)(Chung, Choi, Song, Jung, 
Cho, Choe, Lee, Kim & Kim, 2004). 
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These drawbacks have not been fully overcome 
yet. Therefore, our proposal to introduce ANNs in the 
detection process as good quality estimators is well- 
grounded. 

Some previous research has been made in this area 
for PMT(A.M. Bronstein, M.M. Bronstein, Zibulevsky 
& Zeevi, 2003) and Avalanche Photodiode (APD) 
based (Bruyndockx, Leonard, Tavernier, Lemaitre & 
Devroede, 2004) detectors using neural networks. In 
this work, the detectors are based on continuous scintil- 
lators and Multi- Anode PMTs (MA-PMTs) employing 
charge division read-out circuits (Siegel, Silverman, 
Shao & Cherry, 1996). 



ANN APPROACH TO 2D POSITIONING 
IN PET 

Materials and Methods 

We have employed the GEANT4 (Agostinelli, 2002) 
simulation toolkit to model the detector and to generate 
realistic inputs for the NN. The electronic read-out of 
the resistor circuit was performed with SPICE analysis. 
The supervised training and validation of the ANNs have 
been carried out with the MATLAB Neural Networks 



Figure 1. Siegel s DPC diagram 




Toolbox (The Mathworks, Inc., 2004). We have chosen 
the RPROP algorithm (Riedmiller & Braun, 1993) 
because it proved to converge faster than the standard 
gradient descent algorithm and other variants such as 
the Levenberg-Marquardt algorithm. Radial basis (RB) 
networks were also considered but were discarded in 
the end due to their inferior performance. 

Detector Characteristics 

The model of the detector under study comprises a 
49 x 49 x io mm 3 continuous slab of LSO scintillator 
crystal coupled to a Hamamatsu H8500 Flat-Panel 
MA-PMT. The read-out electronics is a conventional 
DPC-like resistive charge division circuit that proves 
to model Anger's logic accurately. Taking the resistor 
network pattern used by Aliaga et al. ( Aliaga, Martinez, 
Gadea, Sebastia, Benlloch, Sanchez, Pavon & Lerche, 
2006) as a starting point, we have designed a new resis- 
tor network based on the architecture proposed by S. 
Siegel (Siegel, Silverman, Shao & Cherry, 1996) (Fig. 
1) that allows us to estimate the 2D positioning with 
better results. As in the previous design, all 64 channels 
(one per anode of the H8500) are coded into only 4 
output lines, which are then fed into current sensitive 
preamplifiers. The current-ratio matrices A, B, C and 
D corresponding to each output were obtained from 
electronic read-out using SPICE analysis. The network 
was analyzed applying the superposition theorem for 
electric circuits. 

Neural Networks 

Given a collimated source S of y photons with origin 
at (x s , y z s ) emitting perpendicularly to the detector 
surface, we can describe the interaction of a photon in 
the detector as a random variable X — » A , being, A 
a vector of elements a., the number of photoelectrons 
arriving at each anode of the MA-PMT. Thus, the ele- 
ments of the vector J are the inputs of the NN, which 
can be written as 




J k =J^J\-G r R. k 



(1) 



where J k is the /cth output of the charge division net- 
work, G the vector of pad gains of the MA-PMT (in 
our case randomly distributed between 1 and 3) and 
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R. k the transfer function of the DPC from the zth anode Optimization 
to the /cth output of the resistors network. 

The Universal Approximation Theorem (Haykin, 
1 999) claims that any continuous function, defined over 
a determined region, can be approximated uniformly, 
with arbitrary precision, by a Multi-Layer Perceptron 
(MLP) of two hidden layers. Then, our position estima- 
tor can be expressed as 



f = ${j;W,b} 

being Wand b the weights and biases of each neuron of 
the NN. In order to adapt the NN estimator to a function 
f, we begin from a training set composed of pairs (J., X) 
where X. = (x., y.) is the position of the source and J. = 
f(X) is a realization of the outputs of the charge division 
network for an interaction with origin at position X. 
Thus, the weights and biases of the NN are modified 
following a gradient descent algorithm (backpropaga- 
tion) to minimize the mean squared error 



*4a^-^f 



where Fis the transfer function of the NN. Initial values 
of weights and biases are usually determined following 
the Nguyen-Widrow rule (Nguyen & Widrow, 1990). 



The detector surface was partitioned in 49x49 posi- 
tions of 1 mm 2 each. An amount of 1000 valid events 
were generated on each position using GEANT4. Of 
these 1000 events, 500 were used to compose training 
subsets and the remaining 500 to compose test subsets, 
as depicted in Figure 2. 

For each network architecture, 20 different trainings 
were averaged, each one with different initial weights 
and arbitrary gains for each anode. The number of 
epochs was fixed to 800 to ensure convergence in 
all cases. We preferred not to use cross-validation as 
there was no scarcity of patterns to train the network. 
The chosen activation function was the hyperbolic 
tangent (tank). 

We reduced our study to MLPs of two hidden layers. 
A third layer would only increase complexity without 
showing any significant improvement. Our analysis 
showed that an increment on the number of neurons 
in the second hidden layer improves the linearity of 
the response, reducing the systematic error, while an 
increase in the number of neurons in the first hidden 
layer improves the spatial resolution. 

There are two different approaches for 2D position- 
ing: a single NN with 4 inputs and 2 outputs, or two 
independent NNs for ID positioning on each axis. For 
the first scheme, we have simulated MLPs with two 



Figure 2. Methodology to obtain the training/test subsets 

12 3 48 



49x49 mm* 



^> 



• 

• 



49 



1000 events 



s\ / 



500 500 

Training [so] (sol Test 



1578 



A 2D Positioning Application 



Figure 3. 2D positioning histogram/or (a) centroid (Anger) based estimator and (b) NN estimator, using a 49x49 
grid with 1 mm spacing 
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hidden layers and up to 25 hidden neurons (considering 
N 1 + N 2 < 25, and N 1 > N 2 , where N 1 is the number of 
neurons in the first hidden layer andN 2 is the number of 
neurons in the second one. Thus, the NN architectures 
are represented as: number of inputs/N^N^number of 
outputs) to prevent overtraining. 

Results 

The optimum network architecture found using a 
single MLP was 4/15/8/2. However, the best results 
were achieved with the double MLP estimator, with a 
4/9/6/1 architecture, reaching a mean systematic error 
below 0.4 mm at almost all the detector FOV. 

In Fig. 3, we can observe a 2D positioning histogram 
for a grid of 49^49 points separated 1 mm both using 
a centroid estimator (a) and a NN estimator (b). 

The figure clearly shows that the centroid approach 
introduces non-uniform compression artifacts in the bor- 
ders of the crystal, while the NN estimator successfully 
corrects these artifacts and produces a more uniform 
FOV in both dimensions. This improvement is quite 
significative as the UFO V increases from de 3 x 3 mm 2 
to 40x40 mm 2 , which means approximately a 90% of the 
MA-PMT effective area for normal incidence (Mateo, 
Aliaga, Martinez, Monzo & Gadea, 2007). 



FUTURE TRENDS 

It would be desirable to develop a method to extend 
this approach to Depth Of Interaction (DOI) estima- 
tion, especially to deal with oblique incidence. With 
this objective, additional work is being carried out by 
our group (Lerche, Benlloch, Sanchez, Pavon, Gime- 
nez, Fernandez, Gimenez, Escat, Cerda, Martinez &. 
Sebastia, 2005). This would add a fifth input, Z, to the 
ANN, which would enable a very accurate and fully 
3D reconstruction of the interaction point within the 
scintillator. 

It would also be interesting to implement the ANN 
training on a hardware platform, to perform fast on- 
site trainings to and to enable us to calibrate the PET 
instrumentation automatically. 

And last, but not least, we are working on a high 
precision testbench (Fig. 4), which has recently been 
presented in the Real Time Conference 2007 (Monzo, 
Aliaga, Herrero, Martinez, Mateo, Sebastia, Mora, 
Benlloch & Pavon, 2007), allows to link several simula- 
tion tools for each part of the PET system, enabling us 
to model the effects of the electronics in each part of 
the design separately. In this setup, there are separate 
analog and digital parts. The analog part is composed 
by an Application Specific Integrated Circuit (ASIC), 
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Figure 4. Block diagram of the high precision testbench under development 
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and it includes an implementation of our DPC. The 
digital part consists of a Field Programmable Gate Array 
(FPGA) board, where the neural network is going to 
be embedded, among other elements related to signal 
processing. In addition to that, we intend to install a 
radioactive source to obtain real stimuli instead of our 
current synthetic data. 



CONCLUSION 

ANNs have proved to be good position estimators for 
PET, and an interesting alternative to traditional Anger 
logic. The benefits of using ANN-based position esti- 
mators include lower systematic errors, and also lower 
standard deviations of the systematic error, increased 
UFOV (up to 90% of the MA-PMT effective area, for 
normal incidence), less compression artifacts on the 
crystal borders and slightly better spatial resolution, 
especially on the borders. 

Regarding the DPC circuit, it allowed a reduction 
of complexity both in terms of number of variables 
and in terms of hardware resources. 
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KEY TERMS 

Anger Logic : A classic procedure to obtain the posi- 
tion of incidence of a photon on the scintillator crystal, 
which requires connecting the photomultiplier outputs 
to a resistive network to obtain only four outputs. With 
these signals, the position of the scintillation centroid 
is easily obtained using a simple formula. This method 
is acceptable in the central area of the crystal but it 
introduces a considerable error near its borders. 

Depth Of Interaction (DOI): Depth inside the 
scintillator crystal where a photon interacts and produces 
a light distribution. Its 2D coordinates coincide with 
those of the incidence point for normal incidence but 
they differ slightly for oblique incidence. Therefore, its 
determination is vital for oblique incidence cases. 



Discretized Positioning Circuit (DPC): An ana- 
log resistive network that receives a large amount of 
currents and "codes" them into a reduced number of 
them, introducing a minimum delay. These new cur- 
rents are linear combinations of those generated by 
the photodetectors. 

Gamma Camera: A camera that detects gamma 
rays (often called Anger camera). 

Multi-Layer Perceptron (MLP): A kind of feed- 
forward neural network which has at least one hidden 
layer of neurons. 

Neural Network: A network of many simple proc- 
essors ("units" or "neurons") that imitates a biological 
neural network. The units are connected by unidirec- 
tional communication channels, which carry numeric 
data. Neural networks can be trained to find nonlinear 
relationships in data, and are used in applications such 
as robotics, speech recognition, signal processing or 
medical diagnosis. 

Photomultiplier Tube (PMT): A part of the PET 
detector that receives the electromagnetic energy 
from the scintillator cristal and transforms that energy 
into electric pulses. This conversion is done in two 
stages: firstly the photons are absorbed, producing 
free electrons, and secondly a cascade amplification 
takes place. 

Positron Emission Tomography (PET): PET is a 

nuclear imaging technique based on the administration 
of radioactive substances (radiotracers), whose mol- 
ecules have a radioactive isotope (radionuclide), to a 
patient under study, with the aim to trace some chemical 
or physiological process that takes place in the body, 
typically for diagnosis of heart diseases, cancer, etc. 
The images obtained in a PET system are 2D sections 
of the concentration distribution of a radiotracer inside 
the body. When joining these sections, a medical 3D 
image can be obtained. 

Scintillator Crystal: When a particle interacts 
inside a scintillator cristal, it deposits energy. The 
scintitillator crystal re-emits part of that energy as 
photons in the visible spectrum. To allow this light to 
be measured from the outside, the crystal must also 
be transparent to that light. This is done by doping 
the crystal so that permitted states are created in the 
forbidden band of the material. 
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Useful Field of View (UFOV): Area of the scintil- 
lator crystal surface on which the incidence of gamma 
rays produces reasonable estimations of the position 
of incidence. 
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INTRODUCTION 

This paper presents the preliminary studies for the 
creation of a new tool to assist in medical diagnostic. 
The tool will help in the analysis of 2D-PAGE im- 
ages. In order to create a 2D-PAGE image of an ideal 
patient — the patient could be healthy or ill — the tool 
will help us in the creation of an image that facilitates 
and speeds up future diagnostics. The creation of a 
master image has motivated the development of a tool 
to alignment gel images. The tool will make easier the 
correspondence among the proteins into the ideal image 
and the ones of a new image. Due to the fact that image 
registering process is quite complex, we use the Intel's 
library OpenCV which provides functions to calculate 
optical flow and translation vectors. 

This library introduces into the project a set of 
variables unknown by the facultative. To solve this, an 
automatic selection of values for this set of variables is 
necessary. This last task is made with the Evolutionary 
Computation technique called Particle Swarm Optimi- 
zation (Kennedy, R. & Eberhart, J. 1995) 



BACKGROUND 

In the 20th century medicine, the number of medi- 
cal images has been growing. X-ray photographies, 
magnetic resonances, 2D gels images, angiographies 
can be taken as examples. The major difficulty for the 
physician is to integrate all this information in order 
to offer a diagnosis. 



This way, since computers started being used for 
analyzing and treating images at the end of the 20th 
century, one of the most important fields inside the 
application of computers to image processing has been 
the treatment of all medical existing images. It is here 
where technologies of Evolutionary Computation and 
Neural Networks are necessary, because they facilitate 
certain processes of adjustment that, in another way, 
would be extraordinarily complex or laborious. Among 
the most usual technologies used for the processing of 
biomedical images there can be pointed out Artificial 
Neural Networks, Genetic Algorithms, Particle Swarm 
Optimization, Splines or Growth of Regions. 

Amongst some examples, we can emphasize the 
use of Artificial Neural Networks for the analysis of 
radiological images, Genetic Algorithms (Holland, 
J.H.,1975) in the 3D reconstruction of anthropologic 
models (Santamaria, J., Cordon, O., Damas, S., Aleman, 
I., Botella, M., 2006) and in the integration of the infor- 
mation obtained by means of different methods — Com- 
puted Tomography (CT), Magnetic Resonance Imaging 
(MRI),...— (Rouet, J. M., Jacq. J. J., Roux, C, 2000), 
Particle Swarm Optimization for alignment of 2D and 
3D biomedical images (Wachowiak, M. P., Smolikova, 
R., Zheng, Y., Zurada, J. M., Elmaghraby, A. S., 2004) 
or the use of Splines to 2D-PAGE registering (Seow, 
N., Sowmya, A., Sun, C, 2005). 

In our case, the technology to use will be the Particle 
Swarm Optimization dedicated to improve the analysis 
of 2D-PAGE (Seoane, J. A., Mesejo, P., Ruiz-Romero, 
C, Dorado, J., Pazos, A., Blanco, F. J., 2007). 
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PSO APPLIED TO THE OPTIMIZATION 
OF OPENCV PARAMETERS FOR 
2D-PAGE ANALYSIS 

The aim of this investigation is to help the doctors in 
process of identification of certain characteristics in the 
2D-PAGE images (Ruiz-Romero, C, Lopez-Armada, 
M. J., Blanco, F. J., 2005). In order to do that the registry 
image process will consist on the alignment between 
the master image, which has been labelled for every 
protein, and an image whose interesting points have 
been identified by the facultative to study. This registry 
process will make easier to the medical the study of 
the presence or absence of a certain kind of protein 
and its concentration. 

2D-PAGE 

This work uses the images called 2D-PAGE — poly- 
acrylamide gel electrophoresis. The process to obtain 
these images uses the electrophoresis, which is a well 
known analytic technique for macromolecules — DNA 
or Protein — separation. The responsible of this separa- 
tion is the mobility presented by the electrically charged 
macromolecules when a differential voltage is applied. 
The method tries to immobilize the studying molecules 
into a gelatinous material; in this case the material will 
be polyacrylamide. This process was well described by 
(Bueno Garcia, G. 2005) as "A differential voltage is 
applied to the gel with the biological samples inside 
during a concrete period of time. Each molecule will 
migrate through the gel pores with a different speed, 
which is dependent of the electrical charge and the 
mass of each molecule." 

On one hand, the resultant gels are classified on X- 
axis on respect the isoelectric point— PH to which an 
amphoteric substance has no voltage. On the other hand, 
on Y-axis gels are sorted by their molecular mass. 

The resultant image will help us to detect the pres- 
ence or absence of a certain protein, or even the more 
or less protein concentration. This information will 
assist us to know the existence or inexistence of some 
illness or characteristic. 

OpenCV 

OpenCV is an open source library which has been 
developed in C++ for Computer Vision. This library 
is optimized to be applied in real time problems inde- 



pendently of the platform. It is especially oriented to 
images manipulation and processing and also to the 
movement analysis into the image. Some interesting 
information about this library could be found at ( Agam, 
G., 2006). 

To solve the problem, one of the Optical Flow 
functions, which can be found in the OpenCV library, 
was used. Specifically we use the function known as 
CalcOpticalFlowBM. This function divides two images 
into blocks, in order to find the same block of the first 
one in the second image. After this search process, 
the function establishes a set of movement vectors 
which corresponds to the movements of the blocks of 
the image. For more information about this topic visit 
(Department of Electrical Engineering, Nara National 
College of Technology's Web Page, 2006) and (Intel's 
Web Page). 

This function is very useful in our problem because 
we need an alignment image tool for the creation of the 
proteomic diagnostic image. When we try to align two 
images, a function that compares these two images and 
tells us the movement among them will be useful. This 
function searches the similarities using the statistical 
correlation among sets of pixels of the two images. 
The new protein location process will only need the 
movement vector and the block of that spot. 

The function CalcOpticalFlowBM has a set of 
parameters, the ones to be optimized are: 

blockSize: the size of comparable blocks in which 
the image is divided. 

maxRange: neighborhood maximum size around 
a block that would be explore to find the block in 
the second image. 

These parameters will be optimized with an artifi- 
cial intelligence technique, because, in other case, this 
process will be done manually by the user. 

Particle Swarm Optimization 

The Evolutionary Computation technique that has been 
used in this work is the Particle Swarm Optimization 
(PSO). Inspired on the social swarm from nature, was 
developed by Kennedy and Eberhart (Kennedy, J. & 
Eberhart, R., 1995). In a PSO algorithm a particle swarm 
explores the search space. Each particle represents a 
possible solution to the optimization problem. The po- 
sition of each particle is the result of the best position 
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visited by the particle — self experience — and the best 
position of its neighborhood — its social experience. 
When the particle's neighborhood is the whole swarm, 
the best particle in the neighborhood is the best global 
position; the resulted algorithm is called gBest PSO. 
When smaller neighborhood size is used, the algorithm 
is called IBest PSO. The fitness of each particle — how 
far from the optimum is — is calculated using a function 
that varies with the concrete optimization problem. 
Each particle is represented into swarm by: 

x.: particle's current position, 
v.: particle's current speed. 
y.: particle's best self-position 

Particle's best self-position for an i element is the 
best position visited by the particle. If f is the objec- 
tive function then the best self-position in a time t is 
updated as (Eel): 



y,(t+i) = 



|>,.(0, iff(x,.(t+l))>f(y,(t)) 
\x i (t),iff(x i (t + l))<f(y i (t)) 



(1) 



If the particle's best global position is denoted by 
the vector j then: 

y(t)e{y ,y 1 ,...,y s } = min{f(y (t)), fiy^t)),..., f(y s (t))} 

(2) 

where s denotes the swam size. Taking IBest with N. 
neighborhoods in which the best neighborhood's par- 
ticle is denoted byy ., that particle will be known as the 
best neighborhood's particle and is defined by: 

} j (t + l)efi j \fCy j (t + l)) = mm{f(yAt))},Vy i eN j } 

(3) 
where 

(4) 



The neighborhoods are usually defined by the 
particle's index, but can be also defined by topological 
relations. It is easy to see that gBest is only a particular 
case of IBest, where the neighborhood is the whole 
swarm. Notice that the IBest produces more diversity 
in the solutions, but it is also true that it has a higher 
computation time cost than gBest. 

For each PSO algorithm iteration, the update of 
speed v. is specified for each dimension j CI,..., N d , 
being N d the dimension of the problem. So, v.. repre- 
sents the/ 1 element of the i th particle's speed vector. 
Burning on mind this, the z particle speed is updated 
by the following equation: 

v.. (t + 1) = wv y . + eft, (t)(y t] (0 - x t] (0) + c 2 r 2 . (t)(y . (t) - x tJ (*)) 




(5) 



the most important terms are: 



Learning rates (weights), c 1 and c 2 , which deter- 
mine the influence of the learning components, 
cognitive versus social. 

r ij > r 2 j ~ U (1,0 ), these components introduce the 
randomness into the algorithm. 
The term inertia component, w, is used to control 
the influence of the previous speed. High values of 
this term increase the global exploration; however, 
low values increase the local exploitation. 

The cognitive component, y. (t) - x. (t) , represents 
the particle's self-experience to find the better 
solution. 

The social component, y(t)-x t (t), represents 
the swarm knowledge about the better solution. 

The position of particle z, x., is updated using the 
equation 6: 



x / (t + l) = x / (t) + v / (t + l) 



(6) 



The PSO algorithm applies repeatedly the update 
equations, which have been described previously. Until 
a number of iterations is exceeded, the update speeds 
are near 0. The fitness value of the function gives us 
the quality of the solution. 
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In concrete the PSO algorithm that has been used to 
optimize the OpenCV parameters is a complete model. 
The values of c t and c 2 are not zero and then the social 
and cognitive components are taken into account. Fi- 
nally, to comment that these values were fixed to the 
test in the value of 2 as is recommended in (Kennedy, 
J. and Mendes, R., 2006). The inertia weights are fixed 
between the values 0,8 and 1,2 (Eberhart, R. and Shi, 
Y., 2000). 

For more information about PSO the following 
papers could be consulted: (Omran, M.G., Engelbrecht, 
A.P. and Salman A., 2005) and (Kennedy, J., Eberhart, 
R. C, and Shi, Y., 2001). 

Combining all Elements 

The PSO algorithm is used, in this case, to establish 
the block size and the search area in the images. To 
do that we try to minimize the intensity differences 
of each pixel for every couple of blocks considerate 
as equals. 

When the PSO will have ended we will have the 
optimus parameter configuration to the OpenCV func- 
tion. 

Some examples of this application are in Figure 1. 

In the previous left figure we can see the image of 
a new gel to analyze, from which interests to us the 
three red remarked proteins. After the application of 
block matching technique and optimizing the OpenCV 
parameters with the PSO system, we obtain the correct 
position of the proteins on the master image, as we can 
see in the central image. 



FUTURE TRENDS 

After the correct location of the protein, by means of the 
already seen technologies, the main future aim would 
consist on the integration of the intensity of the protein 
in the master image. A possible way of integrating the 
new proteins in the master would be to use the alpha 
channel (transparency of the image) so that the fragment 
of the image to integrate had the values corresponding 
to the alpha level to one and those of the target image to 
zero, thereby this fragment would join the target image. 
With the intention of the integration process being as 
smooth as possible in the edges of spot integrated; one 
degraded will be realized by means of a low pass filter. 
The low pass filter emphasizes the Low Frequencies, 
smoothes images and noises, reducing the variability 
of the image. A median filter could be used to make 
this smoothing, this filter is less sensible to extremely 
asides values. Also the degraded could be made manu- 
ally, choosing a size of window that includes the interest 
point, so that the degraded between the margin of the 
window and the limit of the point was proportional to 
the distance to the spot. 

The final aim would be, by means of some technology 
of artificial intelligence (Artificial Neural Networks or 
Expert Systems), to advise the doctor in the diagnosis, 
for which, obviously, the previous tool should have 
been calibrated and adjusted correctly. 



Figure 1. (Left) New gel; (center) master; (right) new gel with the intermediately step 
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CONCLUSION 

The first thing that we can extract after this work is that 
technologies of Evolutionary Computation can be used 
to assist in medical decision. This way, it has been proved 
that by means of these technologies, a support has been 
realized in the identification of certain interesting points . 
This is useful for the doctors since they will not have 
to know the functioning of the tool of computation, in 
particular the parameters that control the function of 
the library that executes the alignment. 

Besides, these technologies of computation do not 
obtain a unique adjustment but a set of these, which 
will allow to choose the best result from the point of 
view of the user. 
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Wallace, A. R. (1858) On the Tendency of Varieties to 
Depart Indefinitely From the Original Type 

KEY TERMS 

Amphoteric Substance: Substance is one that can 
react as either an acid or base. 

Area of the Search Space: Set of specific ranges 
or values of the input variables that constitute a subset 
of the search space. 

Artificial Neural Networks: System composed of 
many simple processing elements operating in parallel 
whose function is determined by network structure, 
connection strengths, and the processing performed 
at computing elements or nodes. 

Electrophoresis: Separation of molecules (proteins 
or nucleic acids) in an electric field as a function of their 
molecular weight and/or their electric charge. 

Evolutionary Computation: Generic term used 
to indicate any population-based metaheuristic opti- 
mization algorithm that uses mechanisms inspired by 
biological evolution (Darwin, D., 1859) (Wallace, A. 
R., 1858), such as reproduction, mutation and recom- 
bination. 



Genetic Algorithm: An algorithm for optimizing a 
property based on an evolutionary mechanism that uses 
replication, deletion, and mutation processes carried out 
over many generations. (Goldberg, D.E., 1989) (Fogel, 
L.J., Owens, A.J. & Walsh, M.A. 1966) 

Particle: Each of the elements that explore the search 
space in a Particle Swarm Optimization algorithm. 

Particle Swarm Optimization: Evolutionary Com- 
putation technique that basis its functioning on natural 
swarm behaviour like the birds. This algorithm uses a 
swarm of particles to explore the search space 

Polyacrylamide: Acrylate polymer formed from 
acrylamide subunits that is readily cross-linked. 

Protein: A molecule composed of a long chain of 
amino acids. Proteins are the principal constituents of 
cellular material 

Search Space: Set of all possible situations of the 
problem that we want to solve could ever be in. 
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INTRODUCTION 

According to the World Health Organization(WHO), 
the directing and coordinating authority for health 
within the United Nations system http://www.who.int/ 
cancer/en/, from a total of 58 million deaths in 2005, 
cancer accounts for 7.6 million (or 13%) of all deaths 
worldwide. This places cancer as one of the leading 
causes of death in the world, with lung cancer (the main 
cancer leading to mortality) accounting for 1.3 million 
deaths per year. Thus the importance of understanding 
the mechanisms of lung cancer is clear. One approach 
is through the rapid quantification of the gene expres- 
sion levels of samples of healthy and diseased lung 
tissue. This new field blending the knowledge from 
biologists, computer scientists and mathematicians is 
known as Bioinf ormatics and is yielding large quantities 
of data of a very high dimensional nature that needs 
to be understood. 



BACKGROUND 

The increasing complexity of the data analysis proce- 
dures makes it more difficult for the user (not necessar- 
ily a mathematician or data mining expert), to extract 
useful information out of the results generated by the 
various techniques. This makes graphical representation 
directly appealing; for which Virtual Reality (VR) is a 
suitable paradigm. Virtual Reality is flexible; it allows 
the construction of different virtual worlds representing 
the same underlying information, but with a different 
look and feel. VR allows immersion, that is, the user 
can navigate inside the data, interact with the objects in 
the world. VR creates a living experience. The user is 
not merely a passive observer but an actor in the world. 
VR is broad and deep. The user may see the VR world 
as a whole, and/or concentrate the focus of attention on 



specific details of the world. Of no less importance is 
the fact that in order to interact with a Virtual World, 
no mathematical knowledge is required, and the user 
only needs minimal computer skills. A virtual reality 
technique for visual data mining on heterogeneous, 
imprecise and incomplete information systems was 
introduced in (Valdes, J.J., 2002) (Valdes, J.J., 2003) 
(see also http://www.hybridstrategies.com). 

The purpose of this article is to explore the con- 
struction of high quality VR spaces for visual data 
mining (in opposition to classical data mining (Fayyad, 
U., Piatesky-Shapiro, G., & Smyth, P., 1996)) using 
a multi-objective optimization technique applied to 
the understanding of a publicly available lung cancer 
gene expression data set. This approach provides both 
a solution for the previously discussed problem, and 
the possibility of obtaining a set of spaces in which the 
different objectives are expressed in different degrees, 
with the proviso that no other spaces could improve 
any of the considered criteria individually (if spaces are 
constructed using the solutions along the Pareto front). 
This strategy represents a conceptual improvement in 
comparison with spaces computed from the solutions 
obtained by single-objective optimization algorithms in 
which the obj ective function is a weighted composition 
involving different criteria. 



THE MULTI-OBJECTIVE APPROACH: 
A HYBRID PERSPECTIVE 

In order to establish a formulation of the problem based 
on multi-objective optimization, a set of objective 
functions has to be specified, representing the corre- 
sponding criteria that must be simultaneously satisfied 
by the solution. The minimization of a measure of 
similarity information loss between the original and the 
transformed spaces and a classification error measure 
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over the objects in the new space can be used in a first 
approximation. Clearly, more requirements can be 
imposed on the solution by adding the corresponding 
objective functions. Following a principle of parsimony 
this paper will consider the use of only two criteria, 
namely, Sammon's error (Sammon, J.W., 1969) for the 
unsupervised case and mean cross-validated classifica- 
tion error with a k-nearest neighbour pattern recognizer 
for the supervised case. 

The proximity (or similarity) of an object to another 
object may be defined by a distance (or similarity) 
calculated over the independent variables and can be 
defined by using a variety of measures. In the present 
case a normalized Euclidean distance is chosen: 



.i- i 



\ 






(1) 



Structure Preservation: An 
Unsupervised Perspective 

Examples of error measures frequently used for structure 
preservation (Kruskal, J., 1964) (Sammon, J.W., 1969) 
(Borg, I., & Lingoes, J., 1987) are: 
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(4) 



For heterogeneous data involving mixtures of nomi- 
nal and ratio variables, the Gower similarity measure 
(Gower, J.C., 1973) has proven to be suitable. The 
similarity between objects z and j is given by 



Sij = Yl^ k /Y^ w ^ k 



fc=l fc = l 



(5) 



where the weight of the attribute (w ) is set equal to 



ijk' 



ered valid for attribute k. If v, ,.., v, ,., are the values of 

k(i)' k(j) 

attribute k for objects i and j respectively, an invalid 
comparison occurs when at least one them is missing. 
In this situation w... is set to 0. 

ijk 

For quantitative attributes (like the ones of the data- 
sets used in the paper), the scores s.. k are assigned as 
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where R k is the range of attribute k. For nominal at- 
tributes 
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This measure can be easily extended for ordinal, 
interval, and other kind of variables. Also, weighting 
schemes can be incorporated for considering differential 
importance of the descriptor variables. 

Multi-Objective Optimization Using 
Genetic Algorithms 

An enhancement to the traditional evolutionary al- 
gorithm (Back T., Fogel, D.B., & Michalewicz, Z, 
1997), is to allow an individual to have more than one 
measure of fitness within a population. One way in 
which such an enhancement may be applied, is through 
the use of, for example, a weighted sum of more than 
one fitness value (Burke, E.K., & Kendall, G., 2005). 
Multi-objective optimization, however, offers another 
possible way for enabling such an enhancement. In 
the latter case, the problem arises for the evolution- 
ary algorithm to select individuals for inclusion in the 
next population, because a set of individuals contained 
in one population exhibits a Pareto Front (Pareto, V., 
1896) of best current individuals, rather than a single 
best individual. Most (Burke, E.K., & Kendall, G., 
2005) multi-objective algorithms use the concept of 
dominance to address this issue. 

A solution x is said to dominate (Burke, E.K., & 
Kendall, G., 2005) a solution x for a set of m objec- 
tive functions <f : (x), f 2 (x), ..., f m (x)> if 



x,.,, is not worse than x,~ over all objectives. 

(1) (2) J 

For example, f (x ) < f (x ) if f (x) is a minimi- 



or 1 depending on whether the comparison is consid- 



zation objective. 



w 



1590 



Visualizing Cancer Databases Using Hybrid Spaces 



x is strictly better than x 



in at least one obj ective . 
For example, f 6 (x ) > f 6 (x ) if f 6 (x) is a maxi- 
mization objective. 



One particular algorithm for multi-objective opti- 
mization is the elitist non-dominated sorting genetic 
algorithm (NSGA-II) (Deb, K., Pratap, A., Agarwal, 
S., & Meyarivan, T., 2000), (Deb, K., Agarwal, S., 
Pratap, A., & Meyarivan, T., 2000), (Deb, K., Agarwal, 
S., & Meyarivan, T., 2002), (Burke, E.K., & Kendall, 
G., 2005). It has the features that it i) uses elitism, ii) 
uses an explicit diversity preserving mechanism, and 
Hi) emphasizes the non-dominated solutions. 

Original Study 

Gene expressions were compared in (Spira, A., Beane, 
J., Pinto-Plata, V., Kadar, A., Liu, G., Shah, V., Celli, 
B., & Brody, J.S., 2004) for severely emphysematous 
lung tissue (from smokers at lung volume reduction 
surgery) and normal or mildly emphysematous lung tis- 
sue (from smokers undergoing resection of pulmonary 
nodules). The original database contained 30 samples 
(18 severe emphysema, 12 mild or no emphysema), 
with 22,283 attributes. Genes with large detection P- 
values were filtered out, leading to a data set with 9,336 
genes that were used for subsequent analysis. Nine 
classification algorithms were used to identify a group 
of genes whose expression in the lung distinguished 
severe emphysema from mild or no emphysema. First, 



model selection was performed for every algorithm 
by leave-one-out cross-validation, and the gene list 
corresponding to the best model was saved. The genes 
reported by at least four classification algorithms (102 
genes) were chosen for further analysis. With these 
genes, a two-dimensional hierarchical clustering using 
Pearson's correlation was performed that distinguished 
between severe emphysema and mild or no emphysema. 
Other genes were also identified that may be causally 
involved in the pathogenesis of the emphysema. Data 
was from: http://www.ncbi.nlm.nih.gov/projects/geo/ 
gds/gds_browse. cgi ?gds = 73 7. 

Experimental Settings 

Each sample in this study is a vector in a high dimen- 
sional space, and therefore, direct inspection of the 
structure of this data, and of the relationship between 
the descriptor variables (the genes) and the type of 
sample (normal or cancer), is impossible. Moreover, 
within the collection of genes there is a mixture of 
potentially relevant genes with others which are irrel- 
evant, noisy, etc. The need of simultaneously finding 
a visual representation (3D) respecting (as much as 
possible) the set of object interrelationships as defined 
by the original attributes, and the construction of a new 
feature space effectively differentiating the two classes 
of objects present, makes this problem suitable for a 
multi-objective optimization approach. 




Table 1. Experimental settings for computing the pareto-optimal solution approximations by the multi-objective 
genetic algorithm (PGAPack (Levine, D., 1996) extended by NSGA-II). 
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The collection of parameters describing the appli- 
cation of the NSGA-II algorithm is shown in Table-1. 
A modest population size and number of generations 
were used, with a relatively high mutation probability in 
order to enable richer genetic diversity. Randomization 
of the set of data objects was applied in order to reduce 
the bias in the composition of the cross-validated folds 
by providing a more even class distribution between 
successive training and test subsets. The number of 
folds was set in consideration of the sample size. 

Results 



between the classification error objective and Sammon 
error objective Fig- 1(c). This is why, visually, the latter 
space represents a compromised solution between the 
two goals and why it is a trade-off between the two 
objective functions. It should be remembered that the 
class information is not used at all for computing the 
spaces. Chromosome 10, according to Fig-l(a) and Fig- 
1(c), can be considered to be the best multi-objective 
compromised solution in which both error criteria are 
simultaneously as low as possible. It shows reasonable 
class discrimination with a non-large similarity structure 
distortion, which is a very meaningful result. 



The set of non-dominated solutions obtained by the 
NSGA-II algorithm is shown in the scatter plot of 
Fig-l(a), where the horizontal axis is the mean cross- 
validated knn error and the vertical axis the Sammon 
error. The approximate location of the Pareto front is 
defined by the convex polygon joining the solutions 
provided by chromosomes 2, 1, 10, etc. Chromosome 
2 defines a space with a perfect resolution of the super- 
vised problem in terms of the "no or mild emphysema" 
and "severe emphysema" classes (knn error =0), but 
at the cost of a severe distortion of the space. Whereas, 
chromosome 1 approximates a pure unsupervised 
solution (with low Sammon error). Its classification 
error is large indicating that few non-linear features 
preserving the similarity structure lacks classification 
power. This may be due to the large amount of attribute 
noise, redundancy, and irrelevancy within the set of 
22,283 original genes. 

Clearly, it is impossible to represent virtual reality 
spaces on a static medium. However, a composition of 
snapshots of the VR spaces using the solutions along 
the Pareto front approximation is shown in Fig-l(b-d). 
Different mappings (even with important differences 
from the point of view of the mapping error) lead to 
similar 3D visual representations, which indicate good 
solution reproducibility. The similarities are associated 
to the main distributions of the clouds of points, which 
are preserved, while there might be local discrepancies 
with respect to the placement of some objects. 

A solution satisfying classification error as much 
as possible (actually with 0-error) is shown in Fig- 1(b) 
where both classes are separated into 2 main clouds 
of points and a distinct point, Object 6, positioned 
separately from the clouds. It can be seen that Object 
6 is positioned relatively differently in the spaces that 
comprise the best Sammon error Fig-l(d) and trade-offs 



FUTURE TRENDS 

Visualization of data is of potential interest for various 
research communities and the authors have applied 
various visualization approaches to other medical 
data diseases such as those coming from Scleroderma 
Skin disease, Breast Cancer, Alzheimer's disease, 
and Leukemia. But, in fact, a restriction to medical 
data is not made by the authors, for which they have 
also preliminarily investigated data coming from, for 
example, the fields of Hydrochemistry and Geophysi- 
cal Prospecting. 



CONCLUSION 

Amulti-obj ective optimization approach was introduced 
for the problem of computing virtual reality spaces in 
the context of visual data mining and knowledge dis- 
covery applied to relational structures (e.g. databases). 
The multi-objective procedure was based on NSGA-II 
using two objective functions representative of unsu- 
pervised and supervised criteria (mean cross-validated 
knn error as a measure of miss-classification, and Sam- 
mon error as a measure of similarity structure loss). 
This methodology was applied to the analysis of high 
dimensional genomic data collected in the framework 
of Lung cancer research. A Pareto front approximation 
was recognizable from within the solutions provided 
by the final population. Selected solutions from along 
that approximation were used for the construction of 
a sequence of visualizations showing the progression 
from spaces with complete class separation and poor 
similarity preservation to spaces with reversed char- 
acteristics. A solution with a reasonable compromise 
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Figure 1. Set of 100 multi-objective solutions. Those along the Pareto front approximation progressively span the 
extremes between minimum classification error and minimum dissimilarity loss. 3 solutions were selected and 
snapshots of VR spaces computed. Geometries: " light grey spheres " = no or mild emphysema samples, " ^dark 
grey spheres encased within a convex hull" = severe emphysema samples. Behavior = static. 
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between the two criteria was identified and clearly 
contained properties of both extreme solution spaces. 
These research results, although preliminary, showed 
large potential and further investigation is required. 



ACKNOWLEDGMENT 

The authors would like to thank Robert Orchard from 
the Integrated Reasoning Group (National Research 
Council Canada, Institute for Information Technol- 
ogy) for his constructive criticism of the first draft of 
this paper. 



REFERENCES 

BackT., Fogel, D.B., & Michalewicz, Z (1997). Hand- 
book of Evolutionary Computation. Institute of Physics 
Publishing and Oxford Univiversity Press. 

Borg, I., & Lingoes, J. (1987). Multidimensional simi- 
larity structure analysis. Springer- Verlag. 

Burke, E.K., & Kendall, G. (2005). Search Method- 
ologies: Introductory Tutorials in Optimization and 
Decision Support Techniques. Springer Science and 
Business Media, Incorporated. 

Deb, K., Agarwal, S., & Meyarivan, T. (2002). A fast 
and elitist multi-objective genetic algorithm: Nsga-ii. 
IEEE Transactions on Evolutionary Computation, 6 
(2), 181-197. 

Deb, K., Agarwal, S., Pratap, A., & Meyarivan, T. 
(2000). A fast elitist non-dominated sorting genetic 
algorithm for multi-objective optimization: Nsga-ii. 
Proceedings of the Parallel Problem Solving from 
Nature VI Conference, 849-858. 

Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. 
(2000). A fast and elitist multi-objective genetic al- 
gorithm: Nsga-ii. Technical Report 2000001, Kanpur 
Genetic Algorithms Laboratory, Indian Institute of 
Technology Kanpur. 

Fayyad, U., Piatesky-Shapiro, G., & Smyth, P. (1996). 
From data mining to knowledge discovery. In U.F. et 
al., editor, Advances in Knowledge Discovery and Data 
Mining, AAAI Press, 1-34. 

Gower, J.C. (1973). A general coefficient of similarity 



and some of its properties. Biometrics, 1(27):857- 
871. 

Kruskal, J. (1964). Multidimensional scaling by op- 
timizing goodness of fit to a nonmetric hypothesis. 
Psichometrika, 29:1-27. 

Levine, D. (1996). Users Guide to the P GAP ack Par- 
allel Genetic Algorithm Library. Argonne National 
Laboratory, Argonne, IL. 

Pareto, V. (1896). CoursD 'EconomiePolitique,vo\ume 
I and II. F. Rouge, Lausanne. 

Sammon, J.W. (1969). A non-linear mapping for data 
structure analysis. IEEE Transactions on Computers, 
C18:401-408. 

Spira, A., Beane, J., Pinto-Plata, V., Kadar, A., Liu, G., 
Shah, V., Celli, B., & Brody, J.S. (2004) Gene Expression 
Profiling of Human Lung Tissue from Smokers with 
Severe Emphysema. American lournal of Respiratory 
Cell and Molecular Biology 31, 601-610. 

Valdes, J.J. (2002). Virtual reality representation of 
relational systems and decision rules. In P. Hajek, edi- 
tor, Theory and Application of Relational Structures 
as Knowledge Instruments, Prague, Meeting of the 
COST Action 274. 

Valdes, J.J. (2003). Virtual reality representation of 
information systems and decision rules. In Lecture 
Notes in Artificial Intelligence, LNAI 2639, Springer- 
Verlag, 615-618. 



KEY TERMS 

Cancer: A term for diseases in which abnormal 
cells divide without control and can invade other tis- 
sues. Cancer cells can spread to other parts of the body 
through the blood and lymph systems. Cancer is not 
just one disease but many diseases. There are more 
than 100 different types of cancer. http://www. cancer, 
gov/cancertopics/what-is-cancer 

Evolutionary Algorithms: Asubset of evolutionary 
computation, which generally only involve techniques 
inspired by biological evolution such as reproduction, 
mutation, recombination, natural selection and survival 
of the fittest. Candidate solutions to an optimization 
problem play the role of individuals in a population, and 
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the fitness function determines the environment within 
which the solutions "live". Evolution of the population 
then takes place after the repeated application of the 
above operators.http://en.wikipedia.org/wiki/Evolu- 
tionary_Computation 

Gene: 1. A unit of DNA that carries information 
for the biosynthesis of a specific product in the cell. 
2. Ultimate unit by which inheritable characteristics 
are transmitted to succeeding generations in all living 
organisms. Genes are contained by, and arranged along 
the length of, the chromosome. The gene is composed 
of deoxyribonucleic acid (DNA). Each chromosome 
of a species has a definite number and arrangement of 
genes, which govern both the structure and metabolic 
functions of the cells and thus of the entire organism. 
Genes provide information for the synthesis of enzymes 
and other proteins and specify when these substances 
are to be made. Alteration of either gene number or 
arrangement can result in mutation (a change in the 
inheritable traits).http://www.amfar.org/cgi-bin/iowa/ 
bridge.html?page=G 

Hybrid Space: A constructed space that attempts 
to preserve more than one property (possibly in con- 
flict) of the original space. For example, preserving 
distances between objects and the class structure of 
the original space. 



Multi-objective Algorithm: An optimization algo- 
rithm that attempts to find the best solutions across all 
measures of solution acceptability. That is, the Pareto 
Front is sought, even under the situation that it may 
not be theoretically known. 

Unsupervised Algorithm: The true class that 
an object belongs to is not known to the algorithm; 
hence the algorithm is not supervised by a "teacher". 
For example, clustering algorithms are unsupervised 
because each cluster is generated based on the data 
itself. Although the true class may also be known, it 
was not used. 

Virtual Reality: (often called VR for short) Is an 
attempt to provide more natural, human interfaces 
to software. It can be as simple as a pseudo 3D in- 
terface or as elaborate as an isolated room in which 
the computer can control the user's senses of vision, 
hearing, and even smell and touch. http://www.saugus. 
net/Computer/Terms/ 



NOTE 

Copyright is held by her Her Majesty the Queen in 
Right of Canada. 
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INTRODUCTION 

The explosive growth in decision-support systems over 
the past 30 years has yielded numerous "intelligent" 
systems that have often produced less-than-stellar 
results (Michalewicz Z. et al., 2005). The increasing 
trend in developing intelligent systems based on neural 
networks is attributed to their capability of learning 
nonlinear problems offline with selective training, 
which can lead to sufficiently accurate online response. 
Artificial neural networks have been used to solve many 
problems obtaining outstanding results in various ap- 
plication areas such as power systems. Power systems 
applications can benefit from such intelligent systems; 
particularly for voltage stabilization, where voltage 
instability in power distribution systems could lead to 
voltage collapse and thus power blackouts. 

This article presents an intelligent system which de- 
tects voltage instability and classifies voltage output of 
an assumed power distribution system (PDS) as: stable, 
unstable or overload. The novelty of our work is the 
use of voltage output images as the input patterns to the 
neural network for training and generalizing purposes, 
thus providing a faster instability detection system that 
simulates a trained operator controlling and monitoring 
the 3-phase voltage output of the simulated PDS. 



BACKGROUND 

Artificial Neural Networks have been used to solve 
many problems obtaining outstanding results in various 
applications such as classification, clustering, pattern 
recognition and forecasting among many other applica- 
tions corresponding to different areas. 



Power system stability is the property of a power 
system which enables it to remain in a state of equilib- 
rium under normal operating conditions and to regain 
an acceptable state of equilibrium after a disturbance. 
Beyond a certain level, the decrease of power system 
stability margins can lead to unacceptable operating 
conditions and/or to frequent power system collapses 
(Sjostrom M. et al., 1999) (Ernst D. et al., 2004). In 
2003 and within less than two months, a number of 
blackouts happened around the world, affecting mil- 
lions of people. These blackouts include (Novosel D. 
et al., 2004): 

The 14th of August blackout in Northeast United 
States and Canada, which is considered one of the 
worst blackouts in the history of these countries, 
affecting approximately 50 million people. 
The 28th of August blackout in London, which 
affected commuters during the rush hour. 
The 23rd of September blackout in Sweden and 
Denmark, which affected approximately 5 million 
people. 

The 28th of September blackout in Italy, which 
is considered the worst blackout in Europe ever, 
affecting approximately 57 million people. 

In recent years voltage instability has been one 
of the major reasons for blackouts, and it is the root 
cause of the 14 August blackout. Voltage stability is 
threatened when a disturbance increases the reactive 
power demand beyond the sustainable capacity of the 
available reactive power resources. Although, progress 
in the areas of communication and digital technology 
has increased the amount of information available at 
the efficient supervisory control and data acquisition 
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(SCAD A) systems, however, during events that cause 
outages, an operator may be overwhelmed by the ex- 
cessive number of simultaneously operating alarms, 
which increases the time required for identifying the 
main outage cause and then starting the restoration 
process (De Souza A.C.Z. et al., 1997) (Lukomski R. & 
Wilkosz K., 2003). Additionally, factors such as stress 
and human error can affect the operator 's performance; 
thus, the need for an additional tool to support the real- 
time decision-making process which currently exists. 
This tool can be in the form of an intelligent voltage 
instability detector. 

The implementation of neural networks for stabiliz- 
ing power systems in general has been recently sug- 
gested (WenxinL. etal., 2003) (Cardoso G. etal., 2004) 
(Keyhani A. et al., 2005) (Alcantara F.J. & Salmeron P., 
2005) (Mishra, 2006). Research on different approaches 
to the assessment and improvement of voltage stabi- 
lization in particular has proposed different solutions 
to voltage instability using neural networks (Bansilal 
et al., 2003) (Kamalasadan S., 2006) (Lin H.C., 2007). 
However, none of the existing intelligent system solu- 
tions to detecting voltage instability in power distribu- 
tion systems addresses the possibility of providing an 
artificial intelligent detector that simulates a human 
operator whose task is to detect voltage instability via 
monitoring the voltage output. 

This article suggests a novel method for detecting 
voltage instability in power distribution systems. The 
proposed system uses 3-phase voltage output images as 
its database for training and generalizing a supervised 
neural network based on the back propagation learn- 
ing algorithm. The intelligent system comprises two 
phases: the image processing phase, where voltage out- 
put images are pre-processed and meaningful features 
are obtained as the input patterns for the next phase 
which is the neural network implementation. Here, the 
supervised neural network learns to associate the volt- 
age output patterns with three possible classifications; 
namely, Stable, Unstable or Overload. 

The main objective of the proposed intelligent sys- 
tem is to provide earlier detection of voltage instability 
thus aiding a human operator. The intelligent system 
can be operated concurrently with SCADA systems 
thus enhancing the stability of the power distribution 
system. Upon the detection of voltage instability by 
the intelligent system, further measures can be taken 
to quickly sustain stability or clean voltage drop of the 
power distribution system. 



THE INTELLIGENT DETECTION 
SYSTEM 

The intelligent voltage instability detection system 
comprises two phases. Firstly, the image processing 
phase; where the PDS voltage output graph images 
are processed and feature vectors are extracted to be 
used for training and/or testing the neural network. 
Secondly, the neural network implementation phase, 
where the extracted features from the first phase are 
used as input vectors to a neural network. Our neural 
network is based on the back propagation learning 
algorithm due to its implementation simplicity, and 
the availability of sufficient database for training this 
supervised learner. 

Voltage Output Image Processing 

Training and generalizing a neural network using images 
requires sufficient number of images and meaningful 
input patterns. Our database contains voltage output 
images that correspond to a MATL AB-simulated power 
system. Our concern is with the transient stability of 
one distribution power substation whose voltage read- 
ings are taken as outputs of the circuit after simulation 
for 20 seconds, which is considered sufficient time to 
assure the simulation of the three output cases; in par- 
ticular the overload case. These outputs are graphs of 
the sinusoidal waves of voltage during the 20 seconds 
of simulation. For every second on the graph there 
are 50 full waves, which make them concentrated and 
appear like a block. 

The intelligent voltage instability detection system 
has three possible output classifications (Stable, Un- 
stable or Overload). The image database has to account 
for the three cases. For each case there are three voltage 
output graphs representing three voltage phases (a, b, 
c). A total number of 54 cases (18 stable, 18 unstable 
and 18 overload) are simulated, thus resulting in 162 
voltage output graph images which form our database. 
Figure 1 shows examples of the image database repre- 
senting the voltage output cases (stable, unstable and 
overload). 

The objective of the image processing phase is the 
extraction of meaningful patterns which form the input 
to the neural network within the intelligent system. The 
extracted patterns should distinctly represent the dif- 
ferent voltage output cases, while keeping their size to 
a minimum, in order to reduce the computational cost. 
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Figure 1. Voltage output image examples 




Stable 



Unstable 



Overload 



Figure 2 shows an example of finding pixel positions 
(row numbers) at which the highest voltage value is 
recorded. The patterns are extracted and saved as a 
feature vector using the following procedure: 

The three output voltage graphs (representing 3- 
phase voltage) for every case are saved as digital 
images with a size of (540x800) pixels. 
Every image is converted to grey and then resized 
to (202x400) pixels. 

In every image and for every column starting 
from column 2 to column 201 and from first row 
to last row, the value of the pixel where the first 
grey level discontinuity occurs is found and the 
number of this row is saved in a vector. 
This saved value represents the highest voltage 
value at that column in that image. 
The process of recording the row numbers where 
the highest voltage value occurs is repeated for 
the 200 columns, thus yielding a feature vector 
with 200 values for each voltage output graph. 
As a result, each case is represented by a pattern 
or feature vector with 600 values (200 values for 
each graph. 3 graphs representing 3-phase volt- 
ages for each case). 



The number of patterns is equal to the number of 
cases (54 patterns). 

The 600 values within each pattern are normal- 
ized to values from "0" to "1" using division by 
400 which is the highest number of rows. 
The normalized patterns are then used as inputs 
to the neural network classifier for training or 
generalization. 

Neural Network Topology 

The second phase in our intelligent detection system is 
the implementation of the neural network which uses 
the patterns that were extracted from the voltage output 
image database. A total of 162 patterns each with 600 
normalized values are available for this implementa- 
tion. Training the neural network uses 30 cases (10 of 
each: stable, unstable and overload), thus, 90 patterns 
are used for training the neural network. Testing or 
generalizing the trained neural network uses the re- 
maining 72 patterns that represent the other 24 cases 
(8 of each: stable, unstable and overload). 

The neural network consists of an input layer with 
600 neurons receiving the normalized values in each 
pattern, one hidden layer with 28 neurons which as- 
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Figure 2. Example on finding pixel positions at highest voltage values for a voltage unstable case, a- Resized 
grey image, b- Pixel positions of grey level discontinuities 




400 



350- 
300 - 
250 - 
200- 
150 - 
100 - 
50 - 




1 






50 100 150 200 



400 

350 
P(33o\2) — K£) 

300 
» 250 

° 200 

E 

z 150 

100 

50 



P(398,81) 



P(340,l 99) 



? 



P(189,129) 



50 100 150 200 
Number of Columns 



Figure 3. The intelligent voltage instability detection system 
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Table 1. Neural network final training parameters 



Input Hidden Output Learning Momentum ^ _ . Training time Run time 
_ T j , f _ . ° Error Iterations 

Nodes nodes nodes coefficient 



rate 



(seconds) (seconds) 



600 



28 



0.001 



0.33 0.002 12165 



963* 



0.02* 



*using a 1.7 GHz PC with 256 MB of RAM, Windows XP OS and Matlab Programming Language 



Table 2. Intelligent voltage instability detection results 



Stable Case 


Unstable Case 


Overload Case 


All Cases 


Training 


Testing 


Training 


Testing 


Training 


Testing 


Training 


Testing 


Total 


10/10 
(100%) 


7/8 
(87.5%) 


10/10 
(100%) 


8/8 
(100%) 


10/10 
(100%) 


8/8 
(100%) 


30/30 
(100%) 


23/24 
(95.83%) 


53/54 
(98.1%) 



sures meaningful training while keeping the time cost 
to a minimum, and an output layer with 3 neurons 
representing the voltage output classification of stable, 
unstable or overload. During the learning phase, the 
learning coefficient and the momentum rate were ad- 
justed during various experiments in order to achieve 
the required minimum error value of 0.002 which was 
considered as sufficient for this application. Figure 3 
shows the topology of this neural network and the im- 
age pre-processing phase. 

Implementation Results 



FUTURE TRENDS 

The proposed system provides earlier detection of volt- 
age instability. Upon the detection of the instability by 
the intelligent system, further measures can be taken 
to quickly sustain stability or clean voltage drop of the 
power distribution system. Future work will include 
the development of an intelligent voltage stabilizer that 
reads the classification output of our proposed intel- 
ligent detection system, and performs the necessary 
measures needed to stabilize the voltage output in case 
if unstable or overload case detection. 



The neural network learnt and converged after 12165 
iterations and within 16 minutes (963 seconds), whereas 
the running time for the generalized neural network after 
training and using one forward pass was 0.02 seconds. 
Table 1 lists the final parameters of the successfully 
trained neural network. Voltage instability detection 
results using the training image set (90 images repre- 
senting 30 cases) yielded 100% recognition as would 
be expected. The intelligent system implementation 
using the testing image set (72 images representing 24 
cases that were not previously exposed to the neural 
network) yielded correct voltage output classification 
of 23 cases, thus achieving a 95.83% correct detection 
rate. Combining testing and training image sets, an 
overall recognition rate of 98.1% has been achieved. 
Table 2 shows the intelligent voltage instability detec- 
tion results in details. 



CONCLUSION 

A fast and efficient intelligent system for detecting 
voltage instability in power distribution systems has 
been developed and presented within this article. Our 
hypothesis suggested that voltage output images of 
an assumed power system could be used to train a 
supervised neural network to classify the status of the 
power system voltage output. 

The neural network within the intelligent system 
learnt within 963.4 seconds, whereas, the running time 
for the generalized neural network using one forward 
pass was 0.02 seconds. The reduction of training and 
generalization time was achieved by reducing the 
number of input patterns through processing the volt- 
age output images, and adopting a unique method of 
extracting the input patterns using pixel positions. Here, 
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row numbers, at which grey level discontinuities occur, 
are found for each column in the voltage output graph 
and recorded for use as input patterns for the neural 
network implementation. 

Our intelligent voltage instability detection system 
recognized correctly all training patterns as would be 
expected. Successful results were also obtained when 
using the testing patterns that were not exposed to the 
neural network before, yielding 95.83% correct detec- 
tion. Table 2 showed in details the detection results, 
where the only incorrect classification of testing pat- 
terns occurred with a stable case that was classified as 
overload case. However, this single incorrect detection 
is not considered critically dangerous as it would be 
if, say, an unstable case was mistakenly classified as 
stable. 

Finally, this article has proposed a different approach 
to detecting voltage instability in PDS by simulating a 
human operator's monitoring of voltage output graphs. 
Experimental results suggest that our method performs 
well and provides a fast and efficient system for voltage 
instability detection. 
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KEY TERMS 

Artificial Neural Networks (ANN): A network 
of many simple processors ("units" or "neurons") that 
imitates a biological neural network. The units are 
connected by unidirectional communication channels, 
which carry numeric data. Neural networks can be 
trained to find nonlinear relationships in data, and are 
used in various applications such as robotics, speech 
recognition, signal processing, medical diagnosis, or 
power systems. 

Back Propagation Algorithm: Learning algorithm 
of ANNs, based on minimizing the error obtained from 
the comparison between the outputs that the network 
gives after the application of a set of network inputs 
and the outputs it should give (the desired outputs). 

Blackout: A power outage (complete collapse), a 
large-scale disruption in electric power supply. 

Iterations: The number of epochs or repetitions of 
presenting a neural network with training input/output 
data. 

Learning Coefficient: A numerical value that 
defines the learning capability of a neural network 
during training. 

Momentum Rate: Anumerical value that defines the 
learning speed of a neural network during training. 

Pixel: A pixel (short for picture element, using the 
common abbreviation "pix" for "picture") is a single 
point in a graphic image. 



Power Distribution System (PDS): Systems that 
comprise those parts of an electric power system be- 
tween the sub-transmission system and the consumers' 
service switches. It includes distribution substations; 
primary distribution feeders; distribution transform- 
ers; secondary circuits, including the services to the 
consumer; and appropriate protective and control 
devices. 

SCADA: A system that performs Supervisory 
Control and Data Acquisition, independent of its size 
or geographical distribution. 

Voltage Instability: Voltage instability analysis is 
concerned with the inability of assessing the power 
system to maintain acceptable voltages at all sys- 
tem buses under normal conditions and after being 
subjected to disturbances (Kundur P. et al., 2004). A 
major factor contributing to voltage instability is 
the voltage drop that occurs when active and reac- 
tive power flow through inductive reactance of the 
transmission network. Voltage instability can be 
caused when a disturbance increases the reactive 
power demand beyond the sustainable capacity of 
the available reactive power resources. 
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INTRODUCTION 

Several types of structures are used in Coastal Engineer- 
ing with the aim of preventing shoreline erosion, such 
as groynes, detached breakwaters, submerged breakwa- 
ters, etc. Submerged breakwaters have the advantage 
of their minimal visual impact, which has made them 
ever more popular (Chang & Liou, 2007). 

When the incoming waves impinge on a submerged 
breakwater, a process of energy transformation occurs. 
Many laboratory and numerical studies have been car- 
ried out in order to investigate this process (Kobayashi 
&Wurjanto, 1989) (Losada, Losada & Martin, 1995) 
(Losada, Silva & Losada, 1996) (Liu, Lin, Hsu, Chang, 
Losada, Vidal & Sakakiyama, 2000). The energy of 
the incident wave is transformed as follows: (i) one 
part of this energy is transmitted above the crest of the 
structure and — in the case of permeable submerged 
breakwaters — through its interior; (ii) another part is 
dissipated by wave breaking and by friction with the 
structure during the transmission process and finally, 
(iii) the remaining energy is reflected seaward. 

The reflection level is related with the scour in 
front of the structure. Therefore, a good knowledge 
about the reflection process may be helpful in order to 
avoid or at least mitigate the possible problems in the 
structure foundations. However, due to the complexity 
of the problem, the influence of all the relevant param- 
eters (the structure slope and submergence, the water 
depth, the wave period and height, etc.) is not entirely 
understood yet and new approaches are needed. 

In this work, an Artificial Neural Network (ANN) 
has been applied to a series of results obtained from 
a previous study of Taveira-Pinto (2001), in which 
several physical models were tested. Once trained and 



validated, the ANN has been used to estimate the wave 
reflection coefficient. 



BACKGROUND 

ANNs have proved to be a very powerful and versa- 
tile Artificial Intelligence technique (Orchad, 1993) 
(Haykin, 1999). In fact, they have been successfully 
applied to a great number of areas, including system 
identification and control, pattern recognition, data 
processing, time series prediction, modelling, etc 
(Rabunal, Dorado, Pazos, Pereira & Rivero, 2004) 
(Rabunal & Dorado, 2005). 

In Civil Engineering, ANNs have been used most 
notably in Hydrology (Govindaraju & Rao, 2000) 
(Maier & Dandy, 2000) (Dawson & Wilby, 2001) 
(Cigizoglu, 2004). With regard to Ocean Engineer- 
ing, ANN's have been applied to breakwater stability 
(Mase, Sakamoto & Sakai, 1995) (Medina, Garrido, 
Gomez-Martin & Vidal, 2003) (Kim & Park, 2005) 
(Yagci, Mercan, Cigizoglu & Kabdasli, 2005), wave 
forecasting (Tsai, Lin, & Shen, 2002) and tide-forecast- 
ing (Lee & Jeng, 2002). 



ESTIMATION OF THE REFLECTION 
COEFFICIENT AT SUBMERGED 
BREAKWATERS 

ANN Model 

An Artificial Neural Network (Lippmann, 1987) 
(Haykin, 1999) is an information-processing system 
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consisting of an interconnected group of many simple 
process elements. These elements, also called neural 
units or neurons, work together in a similar way as 
biological neurons in the brain. The input is presented 
to the input neurons and propagated through the 
whole network until eventually some kind of output 
is produced. 

In this work, a FeedForward Backpropagation 
network (FFBP) has been used. FFBP networks are 
composed of different layers of neurons linked by 
means of feedforward connections and trained by a 
back-propagation algorithm. Feedforward means that 
the output of a given neuron is used as the input of to 
the following layer, so there are no feedback loops. In 
this case, a network with two neuron layers, a loga- 
rithmic sigmoid hidden layer and a linear output layer, 
has been adopted. 

The adjustment of the network weights in order to 
reduce the error is carried out by means of the back- 
propagation algorithm (Freeman and Skapura, 1991; 
Johansson et al., 1992). The error, i.e., the difference 
between the network output and the target (the expected 
output) is propagated through the network backwards, 
up to the input layer; all the while the weights are 
tweaked. This process is repeated over and over until 
either the error is lower than a threshold or a maximum 
number of iterations are reached. 

The ANN was trained by means of the Bayesian 
Regularisation method (MacKay, 1992), known to be 
effective in avoiding overfitting. 

Experimental Set Up 

The data used for training and testing the ANN were 
obtained in laboratory tests of submerged breakwaters 



(Taveira-Pinto, 2001), carried out in the unidirectional 
wave tank of the Hydraulics Laboratory of the Faculty 
of Engineering of the University of Porto. The wave 
tank is 24.5 m long and 4.8 m wide with a maximum 
water depth of 0.40 m at the test section. The wave 
generator is a piston-type paddle, capable of generat- 
ing regular and irregular waves. At the opposite end, 
a wave-absorbing gravel "beach" with a slope ratio of 
1 :20 was used in order to minimize the wave reflection 
level in the tank. 

Six different impermeable models, constructed with 
wooden panels, were tested with different geometries 
(Fig. 1) at a 1:100 scale. Model height (h) was equal 
to 0.20 m in all cases (20 m in prototype). Different 
slopes (l:n) from 1:1 to 1:5 and two different crest 
widths (B), 0.05 m and 0.10 m, were tested. 

Water surface displacements were measured using 
twin wire conductivity wave probes placed at different 
points in the wave tank. In order to evaluate the reflec- 
tion coefficient (R), three wave probes were located 
on a line parallel to the wave direction. The spectral 
analysis method (Gilbert & Thompson, 1978), based 
on the Kajima (1969) method, was used to separate the 
incident and the reflection components. 

A total of 275 tests were conducted with different 
water depths and irregular wave conditions. Water 
depths (d) between 0.20 m to 0.215 m, leading to free- 
boards (R c ) in the range m to -0.015 m (negative for 
the breakwater crest below the still water level) were 
used during the tests. Irregular waves were generated 
conforming to the JONSWAP spectrum. Significant 
wave heights (H ) from 2 cm to 8 cm and peak wave 
periods (T ) from 0.8 s to 1.25 s were tested. 

In order to carry out the training and the testing 
process of the ANN, the data were randomly divided 



Figure 1. General layout of the testing models 
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into a training data set (184 tests, or 67%) and a testing 
data set (91 tests, or 33%) 

Both the geometrical parameters of the model and 
the wave spectrum parameters were introduced as inputs 
to the ANN by means of the following dimensionless 
numbers: 



i. h^ (relative freeboard ) 

ii. k B (relative crest width) 

iii. k d (relative water depth) 

iv. n (slope) 

where k is the peak wavenumber obtained from the 
following expression: 



47i : 



gk p tanh(/c rf) 



The output of the ANN is the reflection coefficient 
R defined as the ratio between the reflected significant 
wave height and the incident significant wave height. 

Training and Testing Process 

The MSE obtained in the training and the testing of 
the ANN was 4.3 x 10 5 and 5.2 x 10 4 respectively. 
The small value of the testing error proves the ANN 
ability to generalize the knowledge acquired from the 
training data. 



After the training process, the equation of the best 
linear fit to the data (Fig. 2) was y = 0. 9967x + 0. 0011, 
very close to the would-be perfect y = x. The value of 
the correlation coefficient R 2 = 0.9973 was also very 
good. 

As was to be expected, the results of the testing 
process are slightly worse than those corresponding 
to the training process (Fig. 3). Nevertheless, both the 
coefficients of the regression equation (y = 1.0364x 
- 0.0284) and the correlation coefficient (R 2 = 0.9853) 
are very good. 

ANN Application 

Once trained and validated, the ANN was applied to 
analyze the influence of each input in the reflection 
process. In Fig. 4, the relative water depth (k d) is on the 
abscissa, and the reflection coefficient on the ordinate. 
Each curve corresponds to a constant freeboard value. 
The other two inputs were kept constant, the relative 
crest width at 0.2, and the slope at 1:2. 

It means that the more the wave period the less the 
reflection coefficient in accordance with the experience 
which states that the long waves are related to higher 
reflection coefficients than the short waves. 

The reflection coefficient decreases as the rela- 
tive water depth (k d) increases. As for the relative 
freeboard, a smaller value (meaning more water over 
the breakwater crest) leads to a smaller reflection co- 
efficient. In effect, the transmission process becomes 




Figure 2. Regression analysis with the training data 
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Figure 3. Regression analysis with the testing data 
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Figure 4. Variation of the reflection coefficient with the relative freeboard (tt) 
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more efficient with higher submergence levels, leaving 
less energy for reflection. 

The second parameter analyzed was the breakwater 
slope (Fig. 5). The influence of this input on the reflec- 
tion coefficient can be easily explained. The limits of 
the slope value can be linked with a vertical wall (n — » 
0) and a horizontal beach (n — > oo). In the first case, the 



reflection is perfect and the reflection coefficient would 
be equal to 1 . In the second, the reflection coefficient 
goes down to zero as the beach slope tends towards 
the horizontal. These trends are clearly showed on the 
graph. 
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Figure 5. Variation of the reflection coefficient with the breakwater slope (l:n) 
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FUTURE TRENDS 

ANNs have still a long way to go in Ocean Engineering 
applications. Both the stochastic nature of the wave 
action and the complexity of the energy transforma- 
tion processes occurring when waves impinge on a 
structure lead to very intricate problems, which lend 
itself very well to ANNs. Hence it is to be hoped that 
the number of applications increases more and more 
in the coming years. 



CONCLUSION 

The estimation of the reflection coefficient at a 
submerged breakwater under the action of irregular 
waves is a very difficult task due to the great number 
of parameters involved: geometry and nature of the 
structure, water depth, significant wave height, wave 
period, etc. In this work, the behaviour of submerged 
breakwaters under the action of irregular waves was 
analyzed by means of a Feed-Forward Backpropaga- 
tion network (FFBP), trained and tested on the basis 
of laboratory tests. The ANN model was shown to fit 
very closely the results of the physical model tests. The 
reflection coefficient diminished as the relative water 
depth increased. The effect of the model geometry is as 
follows. A decrease in the relative freeboard, meaning a 



higher water level over the structure crest, brings about 
less reflection. As for the seaward slope, the reflection 
coefficient decreases with it. The curves drawn with 
resort to the ANN model not only help interpret these 
trends, but are an useful tool for the design engineer. 
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KEY TERMS 

Artificial Neural Networks: Interconnected set 
of many simple processing units, commonly called 
neurons, that use a mathematical model representing 
an input/output relation. 

Back-Propagation Algorithm: Supervised learn- 
ing technique used by ANNs that iteratively modifies 
the weights of the connections of the network so the 
error given by the network after the comparison of the 
outputs with the desired one decreases. 

JONSWAP Spectrum: Wave spectrum typical of 
growing deep water waves developed from field experi- 
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ments and measurements of waves and wave spectra 
in the Joint North Sea Wave Project. 

Peak Period: The wave period determined by the 
inverse of the frequency at which the wave energy 
spectrum reaches its maximum. 

Reflection: The process by which the energy of the 
incoming waves is returned seaward. 

Significant Wave Height: In wave record analysis, 
the average height of the highest one-third of a selected 
number of waves. 

Submerged Breakwater: Coastal protection struc- 
ture crowned at, or below, the still water level. 
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INTRODUCTION 

The issue of rewarding partially correct answers has 
been addressed by many authors (Guzman, E. &Conejo, 
R., 2004, Gardner-Medwin, A.R. 1995, Huffman, D, 
Goldberg, E, & Michlin, M. 2003). Intelligent systems 
have been designed to assign scores related to the 
importance of missing or incorrect part of an answer. 
Such systems are meant to facilitate the process of 
knowledge assessment. While trying to be efficient in 
evaluating students' responses these systems operate 
with the answers to a single question addressing learning 
a new term, understanding a new concept or mastering 
a new skill. However, experimental practice shows that 
asking several questions about the same item results 
in inconsistent and/or incomplete feedback, i.e. some 
of the answers are correct while others are partially 
correct or even incorrect. 

A large number of computer based systems and 
thus automated assessment systems lack the ability to 
reason with inconsistent information. Such a situation 
occurs when, f . ex. a student answers to two questions 
about one item and one of the answers is correct and 
the other one is incorrect or missing. Reasoning by ap- 
plying classical logic cannot solve the problem because 
the presence of contradiction leads to trivialization, i. 
e. anything follows from 'correct and incorrect' and 
thus all inconsistencies are treated as equally bad 
(Priest, 2001). 

In this paper we discuss how to assess students' 
understanding of new terms and concepts, shortly after 
they have been introduced in a subject. Application of 
many-valued logic allows the system to give mean- 
ingful responses in the presence of inconsistencies. 
Decision making rules, an intelligent agent is applying 
for assessing students' understanding of new terms 
and concepts are presented. Such rules distinguish 
between students' hesitation in the process of giving 



an answer and lack of knowledge. We propose use of 
the generalized Lukasiewicz's logic in a Web-based 
assessment system as a way of resolving problems with 
inconsistent and/or incomplete input. 



BACKGROUND 

A brief overview of a six-valued logic, which is a 
generalized Kleene's logic (Kleene, S., 1952), has 
been first presented by Moussavi, M. & Garcia, N., 
1989. Fitting, 1991 developed further this logic by 
assigning probability estimates to formulas instead of 
non-classical truth values. 

The six-valued logic distinguishes two types of 
unknown knowledge values - permanently or eternally 
unknown value and a value representing current lack 
of knowledge about a state (Garcia, O.N. & Moussavi, 
M., 1990). 

Two kinds of negation, weak and strong negation 
are discussed in Wagner, G., 1994. Weak negation or 
negation-as-failure refers to cases when it cannot be 
proved that a sentence is true. Strong negation or con- 
structable falsity is used when the falsity of a sentence 
is directly established. 

The semantic characterization of a four-valued 
logic for expressing practical deductive processes is 
presented by Belnap N.J., 1977. In Gurfinkel, A. & 
Chechik, M. 2005, it is shown that additional reasoning 
power can be obtained without sacrificing performance, 
by building a prototype software model-checker using 
Belnap 's logic. 

Bi-dimensional systems representing and reason- 
ing with temporal and uncertainty information have 
appeared also in Felix, R, Fraga, S., Marin, R., & 
Barro, S., 1999, and Mulsliner, D.J., Durfee, E.H., 
Shin, K.G., 1993. 
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Alevel-based instruction model is proposed by Park, 
C, & Kim, M., 2003. A model for student knowledge 
diagnosis through adaptive testing was developed by 
Guzman, E. & Conejo, R., 2004. An approach for 
integrating intelligent agents, user models, and auto- 
matic content categorization in a virtual environment 
is presented by Santos, C.T., & Osorio, F.S., 2004. 

The Questionmark system at the University of Leeds 
applies multiple response questions where a set of 
options are presented following a question stem and 
the student can select any number and combination of 
those options. They are significantly more complex than 
multiple choice questions where the student can select 
only one among the suggested options. If a student 
marks some of the correct options (but not all) and or 
some of incorrect options his/her response can be cor- 
rect, incorrect, partly correct or partly incorrect. The 
final outcome is correct or incorrect because the system 
is based on Boolean logic (Goodstein, R. L., 2007). 



MAIN FOCUS OF THE CHAPTER 

The test consists of two questions. According to the 
result of a test, understanding of a term or concept is 
achieved if a student gives a correct answer to ques- 
tions about that term or concept. Such tests are placed 
after a new term or concept has been introduced in the 
theoretical part of a tutoring system. Questions in such 
tests should provide information about 

the student's knowledge, 
the subtler qualities of discrimination, judgement, 
and reasoning necessary in scientific reasoning, 
evaluate the student's judgement as to whether 
cause and effect relationships exist, and student's 
comprehension of a described situation. 

Understanding of a Term 

For evaluating understanding of a single term we pro- 
pose a test where the choices can result in a correct 
answer, incorrect answer or unanswered question. 

Two correct answers imply understanding of that 

particular term. The process of questioning is 

terminated. 

One correct answer and one unanswered question 

imply some doubt about the student's understand- 



ing of that particular term. The system first pro- 
vides additional explanations and then suggests 
to the student to answer one new question taken 
from the database. 

One correct answer and one incorrect answer 
imply doubt about the student's understanding 
of that particular term. The system first provides 
additional explanations and then suggests to the 
student to answer two questions where one new 
question is taken from the database and the other 
question is taken from the first trial and has re- 
ceived an incorrect answer. 
Two unanswered questions imply uncertainty 
about the student's understanding of that par- 
ticular term. The system first provides additional 
explanations and then suggests two new questions 
taken from the database. 
One incorrect answer and one unanswered ques- 
tion imply doubt about the student's understanding 
of that particular term. The system first provides 
additional explanations and then suggests to the 
student to answer the same questions. 
Two incorrect answers imply lack of understand- 
ing of that particular term. The system first pro- 
vides additional explanations and then suggests 
to the student to answer the same questions plus 
one new question taken from the database. 

If the second set of responses contains an incor- 
rect answer and/or unanswered questions the system 
advises the student to work more with the originally 
provided learning materials and terminates the auto- 
mated questioning process. We believe that several 
rounds of questioning would make the learning process 
time consuming for the student and thus disturb the 
learning flow. 

However, the student can start a new assessment 
of his/her understanding of that particular term at any 
time he/she wants. 

Understanding of a Concept 

For evaluating understanding of a concept we propose a 
test with two questions where the choices can result in 
correct answer, partially correct answer, wrong answer 
or unanswered question. 
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Two correct answers imply understanding of the 
concept. The process of questioning is termi- 
nated. 

One correct answer and one partially correct 
answer imply doubt about the student's under- 
standing of the concept. The system first provides 
additional explanations and then suggests to the 
student to answer to the same question that has 
received a partially correct answer. 
One correct answer and one unanswered question 
imply doubt about the student's understanding of 
the concept. The system first provides additional 
explanations and then suggests to the student to 
answer one new question taken from the data- 
base. 

One correct answer and one wrong answer imply 
some doubt about the student's understanding of 
the concept. The system first provides additional 
explanations and examples, and then suggests to 
the student to answer again to the question that 
has previously received a wrong answer and one 
new question taken from the database. 
Two partially correct answers imply doubt about 
the student's understanding of the concept. The 
system first provides additional explanations 
selected theory and examples, and then suggests 
to the student to answer to the same questions. 
One partially correct answer and one unanswered 
question imply doubt about the student's under- 
standing of the concept. The system first provides 
additional explanations and then suggests to the 
student to answer two new questions taken from 
the database. 

One partially correct answer and one wrong answer 
imply doubt about the student's understanding of 
the concept. The system first provides additional 
explanations, selected theory and examples, and 
then suggests to the student to answer two new 
questions taken from the database. 
One wrong answer and one unanswered question 
imply doubt about the student's understanding of 
the concept. The system first provides additional 
explanations selected theory and examples, and 
then suggests to the student to answer to the ques- 
tion that has previously received a wrong answer 
and a new question taken from the database. 



Tests with a Larger Number of Questions 

A test with three questions where the possible 
responses are correct answer, incorrect answer and a 
partially correct answer would require fifteen-valued 
logic. The generalized Lukasiewicz's logic provides 
a solution for a test with any number of questions and 
answer options. 

System Architecture 

The system implementation is using the so-called 
LAMP Web server infrastructure and deployment 
paradigm. It is a combination of free software tools 
of an Apache Web server, a database server and a 
scripting programming platform on a Linux operating 
environment. 

Behind this traditional three-tiers Web deploy- 
ment is a service support sub-system. Communication 
framework based on XML-RPC is used to connect 
the Web application middle-ware and the intelligent 
assessment/diagnostic system together. The separa- 
tion of these two units made it possible to modularly 
design and implement the system as loosely couple 
independent sub-systems. 

The dynamic page publisher compiles a page to be 
presented to the user from a template file in relation to 
the user response, current state variables and activities 
history. A template file contains the static declarations 
of a document. The variables in a particular template 
files are given values by the dynamic page publisher 
module during the production of an HTML document. 
The resulting HTML document is sent back to the 
user Web browser. This module also acts as a handler 
when a user requests a page or sends a form back to 
the Web server. 

The users stack profiler keeps track of user activi- 
ties history in a stack like data structure in the database. 
Each event, like for example response/result of a test or 
a change of learning flow after following a hint given 
by the system, is stored in the database. This module 
provides the percept to the intelligent modules of the 
software agents' sub-system. The users stack profiler 
communicates directly with the agents by sending mes- 
sages over the XML-RPC communication channel. By 
using some common data stored in the database, the 
users stack profiler indirectly affects the behaviour of 
the user's agents and visa verse. 
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The application middleware and the software 
agents run independently of each other. As such, they 
can be situated on different servers. The middleware 
implement the Web side of the system while the software 
agents implement the decision side of users learning 
process. Given a certain response to a particular test 
at a particular user state, what best action can be taken 
to increase the probability that the user will learn a 
particular unit of knowledge? This decision is done 
by the intelligent diagnostics agent. 

The intelligent assessment agent does an early 
diagnostic about absorption of knowledge. A response 
given by a particular student from a test will give the 
system an indication about the state of learning of a 
particular term. This agent helps to implement a part 
of an intelligent tutoring system, which differ from the 
intelligent diagnostics agent. The intelligent assessment 
agent facilitates students' early absorption/assimilation 
of new terms. 



FUTURE TRENDS 

The proposed assessment system is based on many- 
valued logic. Further research is needed to investigate 
which of the available non-classical logics can provide 
more accurate assessments according to what is the 
subject of that assessment - knowledge of a term, un- 
derstanding of a concept, level of mastered skill, etc. 
Another important area for future work involves 
recommendation, of hints, explanations, examples, 
and theory, tailored to each student's responses and 
needs. 



CONCLUSION 

This paper is devoted to assessing students under- 
standing of new terms and concepts. The presented 
framework provides flexibility in the choice of logic 
and can serve as an effective exploration tool for rea- 
soning about many combinations of input coming from 
different sources. 
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KEY TERMS 

Belnap's Logic: It has four truth values T, F, Both, 
None'. The meaning of these values can be described 
as follows: an atomic sentence is stated to be true only 
(T), an atomic sentence is stated to be false only (F), an 
atomic sentence is stated to be both true and false, for 
instance, by different sources, or in different points of 
time (Both), and an atomic sentences status is unknown. 
That is, neither true, nor false (None). 



Kleene's Logic: Kleene's logic has three truth val- 
ues, truth, unknown and false, where unknown indicates 
a state of partial vagueness. These truth values represent 
the states of a world that does not change. 

LAMP Web Server: It is a combination of free 
software tools of an Apache Web server, a database 
server and a scripting programming platform on a 
Linux operating environment. 

Lukasiewicz's Three- ValuedLogic: Lukasiewicz's 

three-valued valued logic has a third value, 1/2, attached 
to propositions referring to future contingencies. The 
third truth value can be construed as 'intermediate' or 
'neutral' or 'indeterminate'. 

Lukasiewicz's Generalized Logic: It is done by 
inserting evenly spaced division points in the interval 
between and 1. 

Six- Valued Logic: The six-valued logic obtained 
as an extension of the Kleene's logic has six truth val- 
ues - true, false, unknown, unknown^ - intermediate 
level of truth between unknown and true, unknown f 
- intermediate level of truth between unknown and 
false, contradiction. 

XML-RPC: It is remote procedure calling using 
HTTP as the transport and XML as the encoding. 
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INTRODUCTION 

Nowadays Information Systems (IS) are designed for 
individual task execution control allowing coordinating, 
monitoring, and supporting the logistical aspects of a 
business process, in other words, the IS has to manage 
the flow of work through the organization. 

The WorkFlow Management represents a critical 
issue for achieving enterprise competitiveness among 
organizations. Many companies have realized that the 
business processes (BP) within their organizations, and 
between the companies and their partners have not been 
clearly described and there are not enough techniques 
and methods to automate the processes. 

The Workflow Management Coalition (WFMC) 
states that workflow (WF) is concerned with the auto- 
mation of procedures where documents, information, 
or tasks are passed to the participants according to a 
defined set of rules to achieve, or contribute to, an 
overall business goal (WfMC, 1999). Another defini- 
tion of WF can be found in (Rusinkiewicz & Seth, 
1994) where workflows are activities involving the 
coordinated execution of multiple tasks performed by 
different processing entities (persons or machines). A 
task or process involves a piece of work and a process 
entity which executes the work. 

Workflow Management (WFM) is a fast evolving 
technology which is increasingly being exploited 
by businesses in a variety of industries. Its primary 
characteristic is the automation of processes involving 
combinations of human and machine-based activities 
(Aalst & Hee, 2002), (Aalst, 1998). 

A Workflow Management System (WFMS) provides 
procedural automation of a business process by man- 
agement of the sequence of work activities and the 
invocation of appropriate human and/or IT resources 
associated with the various activity steps. Although 
the most prevalent use of WFMS is within the office 



environment in staff intensive operations such as insur- 
ance, banking, legal and general administrations, etc, 
it is also applicable to some classes of industrial and 
manufacturing applications (WfMC, 1995). WFMS 
needs to integrate other technologies such that agent 
technology, which provides flexible, distributed, and in- 
telligent solutions for business process management. 
This work presents a methodology for mobile agent- 
based WFMS development. The proposed methodol- 
ogy consists of a modular and gradual specification of 
the system where a mobile agent guides the process 
through organizational units and executes different 
tasks. Several mobile agents evolve through the system 
executing concurrently their assigned task. 



BACKGROUND 
Workflow Management 

The notion of agent in (Yuhong, Zakaria & Weiming, 

2001) is used as "a computer system situated in some 
environment, which is capable of autonomous action 
in this environment in order to meet its design objec- 
tives" (different notions can be found in (Wooldridge, 

2002) and (Nwana, 1996)). These works also highlight 
the benefits of applying agent technology to business 
process management; some of these benefits are: dis- 
tributed system architecture, the inherent autonomy of 
software agents because agents can start a WF based on 
event trigger, the agent reactivity because it have the 
ability to generate alternative execution paths, etc. An 
intelligent agent is capable of autonomous operation 
and flexible behavior in order to meet its design goals 
and also has the properties of reactivity, pro-activity, 
and social ability (Wooldridge, 2001). 

In other works both concepts are integrated. In 
(Repetto, Paolucci & Boccalette, 2003), a methodol- 
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ogy for the design of agent based WF was presented; 
it consisted in three steps. In the first step the authors 
model the BP with UML Activity diagrams by iden- 
tifying all the necessary resources and activities. In 
the second step, all the activities identifying roles in 
parallel paths are grouped. Finally, they define an agent 
for each group. 

Several researchers took the agenttechnology for the 
improvement of WF applications. In (Marin & Brena, 
2005), an architecture for high-level agent-based WF 
is proposed. On this architecture they break down the 
WF execution and the process flow control in small 
execution units handled by intelligent agents and a WF 
processes is controlled in a decentralized way. 

A collaborative approach for workflow systems 
is presented in (Savarimuthu & Purvis, 2004) where 
agents collaborate by forming social network (societies), 
in (Savarimuthu, Purvis & Fleurke, 2004) agents are 
embedded in a system that can monitor and control the 
overall functioning of a workflow process in an agent 
based WF system. 

In (Minhong, Huaiquing & Dongming, 2005), 
agent technology is used for the WF monitoring where 
various intelligent agents working together to perform 
flexible monitoring tasks in an autonomous and col- 
laborative way. 

Multi-Agent Systems 

Mobile agents are autonomous programs that can 
travel from one computer to another under its own 
control. They offer a robust and efficient framework 
to develop distributed applications including mobile 
applications. 

A stationary agent is executed only on the system 
where it began its execution. If it requires information 
from a different system or needs to interact with another 
agent, it uses a standard client-server communication 
(RMI, RPC, CORBA). 

A mobile agent (MA) is not always attached within 
the systems where it starts the execution, rather it is 
capable of moving itself through the network nodes 
where it is allowed, modifying eventually its execution 
environment; the MA carries with itself its current state 
and its code (strong mobility). Furthermore, MAs may 
exhibit several advantageous features due to mobility, 
for example a) interaction with the resource during its 
migration to the needed resource location, keeping the 
bandwidth and reducing the latency of the network 



(Cabri, Leonardi & Zambonelli, 1998), b) interaction 
with the users during the migration to the user location, 
answering faster user requests. In both cases the agent 
continues the interaction with the resource or the user 
even with temporary network connections failures. 

Most of distributed applications fit naturally on the 
model of MAs because the agents can migrate sequen- 
tially through a computer network, they send other 
agents to visit computers in a parallel way, they remain 
stationary and interact with remote resources, etc. 

There exist several organizations with the aim of 
establishing standards for agent software develop- 
ment and agent interoperability. One of them is FIPA 
(FIPA, 1997). 

JADE Development Tool 

JADE (Java Agent Development Framework) is a 
software framework completely implemented in Java 
language which simplifies MA system implementations 
by using a middleware which fulfill FIPA (FIPA, 1 997) 
specifications. The agent platform can be distributed 
through machines (which not necessary share the same 
OS) and the configuration can be managed by a remote 
GUI (Bellifemine, Caire, Trucco & Rimassa, 2006). 

The communication architecture offers flexible and 
efficient message passing where JADE creates and 
manage the incoming private ACL message queue for 
each agent. The complete FIPA communication model 
has been implemented and its components have been 
clearly distinguished. JADE integrates completely 
interaction protocols, ACL, ontology's, transport 
protocols, etc. Most of the FIPA defined protocols are 
available in JADE. 



MOBILE AGENT-BASED WFMS 
DEFINITION 

This work presents a methodology for the develop- 
ment of Mobile Agent-based WFMS. The basic idea 
for conceiving such a system is that a MA guides the 
workflow process through the different organizational 
units in which several tasks are executed according to 
the handled case. 

During the design phase the components are de- 
scribed in a clear and compact way. The system is 
described as a set of interconnected organizational 
units that have a specific resource allocation. The agent 
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Figure 1. Structuring the design of a WfMS 



Resoui 





behavior is determined by two kinds of specifications: 
a) the description of the agent general behavior and 
common knowledge for all the agents, namely, basic 
operations, and interaction protocols (collaboration, and 
resource competition); andb) particular descriptions of 
a specific behavior such as the task plan and an acces- 
sibility roadmap, which describe the assigned process 
and the permitted access to the organizational units 
respectively. This strategy is illustrated in figure 1 . 

The implementation phase is supported by a software 
development guideline allowing the definition of Java 
components (using also the middleware JADE) from 
agent systems specification from the design phase. The 
obtained software is distributed in a set of networked 
computers that manages MA migration. The modular- 
ity allows adaptations to system specification changes 
without difficulties. 

For the sake of readability the proposed method 
is illustrated through a case study dealing with claim 
processes in an insurance company. 

Case Study Description 

The problem can be defined as follows: "Define the 
WF for the claim processes in an insurance company 
in which a customer claims the insurance policy of a 
personal property (real state, car, life insurance). The 
company must receive the claim, request personal data 
from the customer (insurance policy number, etc.), and 
validate the insurance validity, payments and benefi- 
ciaries. It must do the adjustment of real damages, 
validate the case, calculate the correspondent assess- 



ment, do the necessary payments to the customer if the 
complain is valid, or inform in case that the process 
has some invalid data " 

Design Methodology 

The MA that guides the process through the company 
must have the previous knowledge of organizational 
units allocation, resource allocation, the execution plan, 
the list of tasks to perform and, the different needed 
protocols for the resource solicitation or competition. 
Additionally the MA must know the environment 
structure, which is first specified. 

Environment specification. The agent environment 
is defined by the diagram that represents the general 
structure of the company. It is necessary to identify the 
different departments in which some task is executed 
or the information flows, considering all the possible 
cases. Then each department or office is represented by 
a JADE agent container where we can suppose that each 
container is on a different host of a distributed network 
system. In this way each department is represented by 
a host. A host can belong to different platforms, but 
for simplicity work we suppose that all hosts belong 
to the same platform. 

Here we propose a strategy for the platform defini- 
tion but this can have a different distribution. So, for 
the platform distribution it is important to: 

Identify the departments involved with the process 
considering all the possible cases. 
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Figure 2. Claim insurance process a) Block diagram of the organizational units b) Environment Definition 




Identify the information flow and its direction. 
Construct a block diagram with the obtained 
information. 

In this diagram, each block represents a site or host 
of a distributed system (each host has a JADE agent 
container) and the arrows represent the direction of the 
flow or a possible agent migration. 

Consider that in our case of study there are five 
departments: reception, validation, assessment, adjust- 
ment and payment; we can get the blocks diagram and 
the possible platform configurations as the figure 2. 

In JADE the containers creation is achieved as 
follows: 

Main Container, on commands line of the site that 
are going to have this particular kind of container 
we can write: 



C:\ Java JADE.Boot 
[-gui] 



-container-name Name Host 



where each [] represents an optional parameter. 
The rest of the containers are created using the 
line: 

C:\ Java JADE.Boot -container-name Name_Host - 
container -host HostMainContainer 



We can also use the GUI of JADE for the creation 
of containers and agents 

Mobile Agent Definition. For the case study, the 
states of the agent general behavior can be easily 
represented for the Petri net showed in fig. 3. The MA 
selects a plan execution according to the WF process 
definition for the assigned case; the plan indicates which 
sites the agent must visit in order to process the case, 
an access map for the sites, information for resource 
reservation. Also the agent migrates from one site to 
another, collaborates with other agents, competes for 
resource allocation, etc. 

In JADE a mobile agent is created as a sub-class of 
the generic class Agent and its service is registered with 
the DF of JADE. We can use the code in Exhibit A. 

The particular behavior for this agent is given for 
the WF process definition. In other words, is defined for 
the order execution of the tasks involved in the process 
of the case. For the programming of the plan we can 
use the available Behaviours of JADE because they 
represent the tasks that an agent can perform. We can 
use any of the different behaviours included in JADE 
according to the plan to perform. 

The plan is obtained from the WF process definition] 
it defines the execution sequence of the involved tasks; 
so, sometimes it is necessary to construct a diagram (like 
a flow diagram) specifying sequence when it includes 
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Exhibit A. 



public class MobileAgent extends Agent { 
protected void setup() { 

DFAgentDescription dfd = new DFAgentDescription(); 
dfd.setName(getAID()); 

ServiceDescription sd = new ServiceDescription(); 
sd.setType(Type); 
sd.setName(Name); 
dfd.addServices(sd); 
try{ 

DFService.register(this, dfd); 
} catch (FIPAException fe) { fe.printStackTrace(); } 
}} 




Figure 3. Petri Net for the Agent general behavior 
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alternatives; the diagram must indicate in which depart- 
ment the task must be performed or executed. 

Following the case study, assume that the obtained 
diagram is that shown in figure 4; in this figure both 
the task name and the corresponding department are 
indicated. 

If the execution sequence of tasks follows the 
behavior of a Finite State Machine (FSM) we use the 
JADE FSMBehaviour. 

FSMBehaviour fsm = new FSMBehaviour(this) { 

public int onEnd() { 

System. out.println("FSM behaviour completed"); 

myAgent.doDelete(); 

return super.onEnd();} }; 



If we use this particular behavior we have to register 
the appropriated states which represent each task to 
perform, and the transitions that represent its sequence 
or its execution order. In this way, for the register of 
states it is used the function: 

registerState(Function_Name_Taski, state_namei), 

where the first parameter indicates the name of a function 
for the correspondent task and the second parameter 
indicates a name for this state used for the transitions 
registration. 

For this we use the function: 
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RegisterDefaultTransition(state_namel, state_name2), 

which indicates that after the complete execution of 
the function represented for the state state_namel the 
function for the state state_name2 is performed. 

Some states and transitions for the case study could 
be those showed in figure 4. Also one must add other 
states (MovValBehaviour, MovValoBehaviour, MovCo- 
tizBehaviour y MovP ay Behaviour) for the migration 
of the agent when the performed task must be realized 
in a department different to the current. 

Because of the agent migration the appropriated 
mobility ontology must be registered: 

getContentManager().registerLanguage(new SLCodec(), FIPAN- 

ames.ContentLanguage.FIPA_SLO); 

// register the mobility ontology 

getContentManager().registerOntology(MobilityOntology.get- 

Instance()); 

When the states for the FSMBehaviour are registered 
we use only a name for the function to perform; so, it is 
necessary to add the Java statements for each function 
for the selected behavior. Each one of these methods 
is added as a class that inherits of one of the JADE 
Behaviours. These methods contain the statements 
to execute in each task; for the case study a method 
definition is included in figure 4. 

If the agent has to collaborate with another agent to 
perform a task, the JADE protocol FIPA-Request can 
be used (an example is shown in figure 5). 

Following the guidelines given above the general 
behavior of the MA and the environment where it 
evolves can be defined. In a similar way the behavior 
of stationary agents can be also established. 

The proposed methodology has been applied to 
several case studies leading to modular software, 
which has been executed on several networked (LAN) 
personal computers. The tests were performed on sites 
in which a JADE platform was defined. Nevertheless 
different configuration platforms can be integrated in 
theWFMS. 



FUTURE TRENDS 

The proposed methodology for the development of 
WFMS is a first step towards the automation of complex 
business processes in large enterprises. However in 
companies where the organizational units are distributed 



in several cities, the MA must travel though the web; 
then more sophisticated capabilities must be added 
to the agents and their environment, namely security 
protocols and agent losses control. 

Furthermore, it would be advantageous consider the 
interaction with existing WFMS based on the standards 
provided by WFMC to profit of existing information 
and business strategies. 



CONCLUSION 

Automation of business processes yields improvements 
to the productivity of companies. This work proposed 
a methodology for developing WFMS based on mobile 
agent technology. This methodology allows modular 
definitions for the environment and mobile agent be- 
haviors. The proposed implementation technique uses 
JADE getting all the JAVAadvantages. The mobile agent 
can interact with other agents to collaborate, negotiate 
or compete for resources. Due to the modularity of the 
obtained software, it can be easily modified according 
to modifications to WFMS specifications. 
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Figure 4. Identification of states and transitions for the FSMBehaviour 
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Figure 5. Fragment of the FIPA-Request protocol 
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KEY TERMS 

Activity: A description of a piece of work that forms 
one logical step within a process. An activity may be 
a manual activity, which does not support computer 
automation, or a workflow (automated) activity. 

Agent: An agent is a computer system situated in 
some environment, and that is capable of autonomous 
action in this environment in order to meet its design 
objectives. 

Business Process: A set of one or more linked 
procedures or activities which collectively realize a 
business objective or policy goal, normally within the 
context of an organizational structure defining func- 
tional roles and relationships. 

Case: The representation of a single enactment of a 
process, using its own process instance data, and which 
is (normally) capable of independent control and audit 
as it progresses towards completion or termination. 

Mobile Agent: A program that can migrate from 
a computer to other computer within a heterogeneous 
network. The program chooses when and where to mi- 
grate. It can suspend its execution at an arbitrary point, 
transport to another computer and resume execution 
in the new computer. 



Multi- Agent System: Is a collection of software 
agents that work in conjunction with each other. They 
may cooperate or they may compete, or some combina- 
tion of cooperation and competition. 

Process: A formalized view of a business process, 
represented as a co-coordinated (parallel and/or serial) 
set of process activities that are connected in order to 
achieve a common goal. 

Process Definition: The representation of a business 
process in a form which supports automated manipula- 
tion, such as modeling, or enactment by a workflow 
management system. 

Workflow: The automation of a business process, 
in whole or part, during which documents, information 
or tasks are passed from one participant to another for 
action, according to a set of procedural rules. 

Workflow Management Coalition: TheWFMCis 
a non profit organization with the objectives of advanc- 
ing the opportunities for the exploitation of workflow 
technology through the development of common ter- 
minology and standards. 

Workflow Management System: A system that 
completely defines creates and manages the execution 
of workflows through the use of software, running 
on one or more workflow engines, which is able to 
interpret the process definition, interact with workflow 
participants and, where required, invoke the use of IT 
tools and applications. 
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cooling schedules, for simulated an- 
nealing 344-352 

cooperation with metaheuristics, ap- 
proaches for 480 

cooperation, definition 858 

cooperative multi-search metaheuris- 
tics 481 

cooperative multi-search metaheuris- 
tics, definition 486 

corporate memory 983 

corpus validation 541 

corpus, definition 1472 

corpus, production of 542 

correlation, definition 795 



correlation-based feature selection, 

definition 559 
correntropy, definition 909 
correspondence problem, definition 

1190 
cost function 1297 
cost function, definition 1101, 1 302 
cost function, definition 574 
countably minimitive ranking func- 
tion, definition 1355 
counting (exhaustive), definition 479 
crisp set, definition 973 
crossover 593 
crossover operator 650 
cross-over, definition 595 
crossover, definition 653, 758 
cross-validation (CV) 68 
cross-validation, definition 524 
cryptanalysis 179-185, 186-191 
cryptanalysis, automated 179-185 
CUDA, definition 1503 
cumulants, definition 1231, 1 272 
curse of dimensionality 65 
curse of dimensionality, definition 

666, 1087 
Cyc 335, 336 
CYK, definition 602 
cytoplasm, definition 382 

D 

data acquisition 1448 

data classification 796 

data mining 172 

data mining algorithm, definition 
1329 

data mining tasks, incorporating 
fuzzy logic 884 

data mining technique, definition 
1329 

data mining, definition 423, 486, 
787, 802, 891, 1135, 1211 

data partitioning 428 

data preprocessing module 555 

data routing 1534 

data visualization, definition 423 

data warehouse data modeling 425 

data warehousing design methodolo- 
gies 424 

data warehousing development 424 

database binary representation, defi- 
nition 802 

database repairs 1428 

data-centric approach (DCA), defini- 
tion 1329 



data-driven sub-word units 1469 

DEBBIE system 139 

decision attribute, definition 702 

decision fusion, definition 1335 

decision making in intelligent agents 
431 

decision support and analysis (DSA) 
992 

decision support systems (DSSs) 992 

decision support systems, personal- 
ized 1310-1315 

decision tree 545, 882, 886 

decision tree, definition 436, 702 

decision trees, and CRM 438 

decision trees, and data modelling 
437-442 

decision variable, definition 409 

deduction rules 778 

deep knowledge, definition 995 

default reasoning 330 

default reasoning, definition 1040 

defect detection 132, 211 

defect detection, real time 211 

defect tolerance 1557 

deformable finite element model 550 

defuzzification 970 

defuzzification, definition 695, 973, 
1128 

defuzzyfication, definition 727 

degree of disbelief, definition 1355 

degree of entrenchment, definition 
1355 

degree of truth, definition 727 

DEGREE system 139 

degrees of freedom problem, defini- 
tion 470 

delayed duplicate detection, defini- 
tion 505 

delta test, definition 666 

dependency grammar 449 

dependency parsing 449-455 

dependency treebanks 451 

dependency trees 450 

dependency, definition 1 1 84 

derived predicate, definition 1028 

DES state 677 

description logic (DL) 402, 497 

designed of experiments, definition 
1574 

detection, definition 962 

developmental robotics 464 

device, definition 479 

differential evolution 488 
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differential evolution with self-adap- 
tation 488 

differential evolution, definition 493 

differential evolution, related work 
to 489 

diffusion constant, definition 1465 

diffusion of innovation 53 

diffusion process, definition 1465 

digital circuits, evolved synthesis of 
609 

digital signal processing (DSP), defi- 
nition 795 

dilation, definition 388 

dimensional model, definition 430 

dimensionality reduction methods 
1045 

dimensionality reduction, definition 
638 

dimensions, definition 430 

diphone, definition 795 

direct on-chip implementation strat- 
egy 836 

DisCOP, definition 513 

discredibility detection 567 

discrete events systems, definition 
687 

discrete recurrent network for optimi- 
zation 1112 

DisCSP, definition 513 

disk-based search 501 

dispositional models 331 

dispositional models, definition 333 

dissimilarity data, definition 1251 

dissimilarity data, self-organizing 
map 1244 

dissimilarity SOM, definition 1251 

dissimilarity, definition 566 

distibutional equivalency, definition 
1035 

distributed configuration 401 

distributed constraint reasoning 507, 
509 

distributed representation of composi- 
tional structure 514 

distributed representation, definition 
519 

distributed representation, varieties 
of 515 

diverging mapping, definition 1350 

DNA computing 1174 

DNA micro-array, definition 1070 

DNA, definition 382 

document clustering (DC) 655 



document management, definition 
988 

domain independent planner, defini- 
tion 1028 

domain of variable, definition 409 

domain pruning, definition 409 

dominance, definition 1196 

driver support system (DSS) 554 

Drools 1404 

DSHP 828 

dual (lattice), definition 1242 

duality, definition 1110 

duplicate detection scope 503 

duplicate elimination scope, defini- 
tion 1554 

DyCoN, definition 1218 

DyCoNG, definition 1218 

dynamic adaptation module 856 

dynamic appraisal, a robot model 
1376-1382 

dynamic associative memory (DAM) 
248 

dynamic reconfigurable circuit 612, 
614 

dynamic scheduling systems, defini- 
tion 859 

dynamic scheduling, hybrid meta- 
heuristics based system 853 

dynamical recurrent neural networks, 
definition 1158 



E A multi-model selection 521 

EA multi-model selection for SVM 
520 

e-commerce, and intelligent software 
agents 940-944, 945-949 

edge detection 373 

edge detection, definition 375 

education in knowledge society 532 

efficiency metrics 510 

Eigenface, definition 371 

e-learning 339 

e-learning in new technologies 532 

e-learning, definition 535 

electric load forecasting 813 

electricity load prediction 1514 

electrocardiogram, definition 916 

electroencephalography (EEG), defi- 
nition 375 

element, definition 1101 

elitism 594 

elitism, definition 1150 

embedded agents 87 



embodiment, definition 470 

emergence, definition 360, 470, 1063 

energy function, definition 1120 

energy minimizing active models 
547 

energy parameters, definition 758 

e-note-taking 337 

ensembles, definition 1158 

entailment, definition 1 1 84 

entity-relationship data model, defini- 
tion 430 

environment, definition 1093 

epistemic logic, definition 1 094 

equivalence, definition 1374,1431 

erosion, definition 389 

error backpropagation, definition 580 

error function, definition 676 

error threshold, definition 479 

evolution process 604 

evolutionary algorithm (EA), defini- 
tion 525, 604, 608, 1017, 
1150, 1158, 1196 

evolutionary algorithms in discred- 
ibility detection 567 

evolutionary algorithms, definition 
580, 795, 1048, 1594 

evolutionary algorithms, multi-objec- 
tive 1145 

evolutionary approaches to variable 
selection 581 

evolutionary artificial neural net- 
works 576 

evolutionary artificial neural net- 
works, definition 580 

evolutionary computation (EC) 624, 
767, 875, 1583-1588 

evolutionary computation (EC) field 
744 

evolutionary computation (EC) tech- 
niques 647 

evolutionary computation, definition 
493, 580, 602, 624, 807, 859, 
872, 1144, 1309 

evolutionary computing 10 

evolutionary grammatical inference 
596 

evolutionary learning algorithms 
1014 

evolutionary mechanism, definition 
1302 

evolutionary optimisation 1042 

evolutionary optimization method 
522 
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evolutionary programming, definition 

1144 
evolutionary robotics 603, 604, 608, 

1101 
evolutionary techniques 588, 648, 

653 
evolutionary techniques, variable 

selection 581 
evolutionary time, definition 574 
evolvable hardware 614, 616 
evolved synthesis of digital circuits 

609 
exclusion error 1357 
executive information systems 1310 
expansion operator 600 
expansion, definition 1374 
expectation, definition 1 1 84 
expected value, definition 436 
expert combination 318 
expert knowledge, definition 336 
expert systems 1404 
expert systems, commonsense 334 
explanatory variables, definition 423 
exploitation, definition 1196 
exploration, definition 1196 
exploratory data analysis, definition 

787 
extended Kalman filter 1417 
extension aware techniques, defini- 
tion 500 
extensivitity, definition 1110 
external memory breadth-first search 

502 
external memory heuristic search 

503 
external memory search algorithms 

502 



Fl -measure, definition 545 
face detection, definition 1265 
face identification, definition 1265 
face recognition 248, 1455-1461 
face recognition, definition 371 
facial expression recognition 625 
facial expression recognition system, 

definition 630 
fact table, definition 430 
factor analysis, definition 901 
factored-state Markov decision pro- 
cess, definition 830 
factors, definition 1574 
fan-in, definition 479 



fast correlated-based filter (FCBF) 
method 634 

fault-tolerant, definition 479 

feature extraction 202, 1449 

feature extraction problem, definition 
1190 

feature extraction, definition 1454 

feature extraction, definition 371, 
638 

feature selection 66, 632 

feature selection, definition 371, 559, 
638 

feature selection, definition 916, 
1237 

feature space, definition 560 

feed forward off-chip implementation 
834 

feed- forward artificial neural network 
639-646 

feed- forward artificial neural net- 
work, definition 1011, 1017 

feed- forward artificial neural net- 
works 1004, 1012 

feedforward neural networks, defini- 
tion 1395 

field programmable array (FPGA), 
definition 560 

field programmable gate array 
(FPGA) 555 

filling factor, definition 1302 

filter method, definition 638 

finite automata, definition 602 

finite element method, definition 553 

FIPA, definition 923 

first-order method, definition 1011 

first-order predicate calculus 334 

FIS, definition 1242 

Fischertechnik system 1385 

fitness function 756,1303 

fitness function, definition 595 

fitness function, definition 758, 872, 
1309 

fitness landscape, definition 1150, 
1196 

fitness, definition 773, 1150, 1196 

fixed point theorem, definition 1224 

fixed search 584 

flexible manufacturing system (FMS) 
749 

flexible query, definition 931 

flocking, definition 540 

Floreano, Dario 605 

FOCUS 634 

fonator system 1439, 1445 



forward chaining, definition 995 
forward selection strategy, definition 

545 
forward selection, definition 882 
Foundation for Intelligent Physical 

Agents 1405 
Fourier transform (FT) 223 
FPGA implementation of ANNs 831 
FPGA, definition 839 
frames 329 

Fraunhofer AIS, Germany 1385 
frequency domain stability inequality 

of Popov, definition 1224 
frequent item-sets, discovery of 77 
frequent item-sets, generating 77 
FTS engine interface 657 
FTS engine, structure of 656 
full adder, definition 1480,1561 
full factorial design, definition 1574 
full-text search 656 
full-text search (FTS) engine, defini- 
tion 660 
full-text search engines 654 
functional data analysis, definition 

666 
functional dimension reduction 661 
functional equation, definition 676 
functional imaging 372 
functional network, definition 676 
functional networks 667 
fusing variables, definition 1517 
fusing variables, definition 462 
fuzzification 970 
fuzzification, definition 695,718, 

766, 973, 1128 
fuzzy algorithm, definition 718 
fuzzy approximation of DES state 

677 
fuzzy ART equations 1492 
fuzzy ART neural network stream 

processing 1491 
fuzzy ART training process 1492 
fuzzy ART, definition 1496 
Fuzzy ART, definition 1503 
fuzzy clustering algorithms 711 
fuzzy c-means (FCM) 711 
fuzzy compositional rule of inference 

(CRI), definition 733 
fuzzy control, definition 695 
fuzzy decision trees 696 
fuzzy inference system 1124 
fuzzy inference systems, definition 

718, 823 
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fuzzy k-nearest neighbour (FKNN) 

711 
fuzzy logic 31, 330, 710, 797, 884 
fuzzy logic estimator 719 
fuzzy logic in data mining tasks 884 
fuzzy logic systems, supervised learn- 
ing of 1510 
fuzzy logic, definition 462, 687, 695, 
727, 802, 816, 891, 973, 1128, 
1237, 1496, 1517 
fuzzy matching, definition 660 
fuzzy membership function, defini- 
tion 695 
fuzzy neural networks (FNN) 32 
fuzzy operator, definition 718 
fuzzy petri nets, definition 687 
fuzzy relational neural network 
(FRNN), definition 866 
fuzzy rule base, definition 462, 1517 
fuzzy rule interpolation 728 
fuzzy rule interpolation methods 729 
fuzzy rule interpolation, definition 

733 
fuzzy rule, definition 742, 1128 
fuzzy set theory, definition 780 
fuzzy set, definition 695, 708, 802, 

973 
fuzzy similarity calculation 800 
fuzzy similarity for data classification 

796 
fuzzy similarity representation 798 
fuzzy similarity representation, model 

for 798 
fuzzy SQL query, definition 932 
fuzzy SQL, definition 93 1 
fuzzy system (FS), definition 766 
fuzzy system, definition 695, 742, 

1129, 1403 
Fuzzy systems 884 
fuzzy systems, multilayer optimiza- 
tion approach for 1121 
fuzzyfication, definition 727 



game-based learning 140 
games theory, definition 1023 
gas-fired cooktop burners 1568 
gas-fired cooktop burners, thermal 

design 1568 
gate (logic), definition 479 
Gaussian distribution, definition 

1087 
Gazetteer, definition 1567 
gene expression 236, 241 



gene expression data, cluster analysis 
289-296 

gene expression, definition 1211 

gene finding 237 

gene mapping 242 

gene regulation network use 744- 
747 

gene regulatory network 237 

gene, definition 382, 1595 

generality 334 

generalization, definition 1523 

generalized cellular automaton, defi- 
nition 359 

generalized constraint language 111 

generalized constraints 111 

generalized cylinder (GC) 114 

generalized theory of uncertainty 
(GTU), definition 780 

generalized-regression (GR) NN, 
definition 1424 

generation, definition 1150, 1196 

generative topographic mapping 
(GTM), definition 795 

generic gate 614 

genes, definition 595 

genetic algorithm 592, 653, 742, 
766, 807, 916, 1101, 1302, 
1335, 1488 

genetic algorithm, with division into 
species 651 

genetic algorithms (GA), definition 
616 

genetic algorithms (GAs) 647, 
748-754, 1504 

genetic algorithms components, defi- 
nition 1517 

genetic algorithms components, defi- 
nition 462 

genetic algorithms, definition 758, 
859, 872, 1237 

genetic fuzzy rule generator architec- 
ture 457 

genetic fuzzy system 762 

genetic fuzzy system, definition 766 

genetic fuzzy systems, ports and 
coasts engineering 759 

genetic networks 242 

genetic operators 599 

genetic operators 762 

genetic operators, performance effect 
of 1504-1509 

genetic pool 650 

genetic programming (GP) 241, 527, 
598, 619 



genetic programming, definition 773, 

916 
genetric regulatory network model 

745 
genome encoding 762 
genotype, definition 617, 624 
geodesic active contour model 549 
geodesic curve, definition 553 
geriatric residences, planning agent 

for 1316-1322 
GGGP systems, crossover operator 

769 
Gibbs sampler, definition 1465 
global classification 1359 
global constraint, definition 409 
global path planner, definition 1 079 
global path planners 1072 
global stability, definition 1224 
GNG, definition 1218 
gold-silver price, definition 1003 
GP 590 

GPGPU (general-purpose computa- 
tion on GPUs), definition 
1496, 1503 
GPU (graphics processing unit), defi- 
nition 1496, 1503 
gradient descent 1417 
grammar evaluation 600 
grammar genetic programming ap- 
proach 599 
grammar-guided genetic program- 
ming (GGGP) 767 
grammar-guided genetic program- 
ming, definition 773 
granular computing 774, 775 
granular computing, definition 780 
granulometries 1107 
graph edge, definition 1 079 
graph invariant, definition 709 
graph node, definition 1 079 
graph partition problem 1115 
graph subsumption, definition 1 1 84 
graph, definition 505 
graph, definition 709 
graphics processing units (GPUs) 

873-878 
greedy algorithm, definition 546 
greedy search, definition 883 
gross errors, definition 1395 
growing cell structures 782 
growing cell structures visualization 

781 
growing cell structures, definition 
787 
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growing neural gas (GNG) 1364 
growing neural gas, definition 1368 
GTM user modeling 788 

H 

HAM-PHAM 828 

hardware genetic algorithm (HGA) 

612 
harmony search (HS) 803 
harmony search model and applica- 
tion 803 
harmony search, definition 807 
Harris corner detector, definition 389 
HCI applications 625 
health care agent, nature-inspired 

1317 
Hebbian learning, definition 1368 
heterogeneous multi-core computing, 

definition 1503 
heuristic function, definition 506 
heuristic information, definition 1535 
heuristic knowledge, definition 995 
heuristic modelling 968 
heuristic, definition 486, 602, 1023, 

1465 
HEXQ 828 
HGA, definition 617 
Hidden Markov Model 1565 
Hidden Markov Model (HMM), defi- 
nition 630, 1567 
hierarchical fuzzy logic systems 456 
hierarchical fuzzy logic systems, 

definition 463, 1517 
hierarchical neuro-fuzzy systems 

808, 817 
hierarchical reinforcement learning 

825, 830 
hierarchical task decomposition, 

definition 830 
high level synthesis, definition 839 
holographic reduced representation 

(HRR), definition 519 
holonomous robot, definition 847 
homogeneous multi-core computing, 

definition 1503 
HOPS 840,841 
HOS 1226 

HOS, definition 1231, 1272 
Human Genome Project 236 
human-computer interaction (HCI), 

definition 631 
human-machine interaction 628 
hybrid algorithm 636 
hybrid dual camera systems 849 



hybrid dual camera vision system 
840 

hybrid intelligent system, definition 
866, 872 

hybrid intelligent systems 854 

hybrid intelligent systems, definition 
859 

hybrid meta-heuristics based schedul- 
ing system 855 

hybrid method, definition 638 

hybrid methods 636 

hybrid methods, definition 1 079 

hybrid navigation 1076 

hybrid omnidirectional pin-hole sen- 
sor (HOPS) 841 

hybrid perspective 1589 

hybrid scheduling module 856 

hybrid space, definition 1595 

hybrid spaces 1589 

hybrid two-population genetic algo- 
rithm 585, 649 

hybridization, definition 780 

hypergraph, definition 709 



ICA model 22-30 
idempotence, definition 1110 
IFIP framework 968 
IF-THEN rules, definition 727 
ill-posedness, definition 375 
image - video face recognition 1457 
image analysis, definition 1309 
image analysis, particle swarm opti- 
mization 1303 
image integration 373 
image integration, definition 376 
image moments, definition 389 
image pre-processing 1448 
image pre-processing, definition 

1454 
image processing, definition 973 
image rectification 384 
image restoration, and Bayesian neu- 
ral networks 223-230 
image segmentation 373 
image transformation, definition 

1110 
image-based visual homing 1185, 

1190 
immersive technologies 536 
immune artificial system 238 
impact-echo 192-198, 199-205 
implementation, definition 336 
implicit hybridization 1194 



imprecise marking, definition 687 

inclusion error 1357 

increasingness, definition 1111 

incremental learning operator 600 

independence assumption 880 

independence subspaces, definition 
901 

independent component analysis 
(ICA) 270-274, 883, 1265 

independent subspaces 892 

indexer, definition 660 

individual, definition 493, 574 

induction, definition 703 

infant vision system 248 

inference network, definition 995 

inflective language, definition 1472 

inflective languages, statistical mod- 
elling of 1467 

information extraction (IE) 106, 
1567 

information pattern, definition 1218 

information potentials and forces, 
definition 909 

information processing, definition 
342 

information quality decay, definition 
423 

information retrieval, and AI 151- 
156 

information retrieval, definition 423 

information theoretic learning (ITL) 
902 

information theoretic learning, defini- 
tion 909 

initial annealing temperature, defini- 
tion 574 

initial population 592 

input data selection 578 

insect behaviour 1537 

instance, definition 891 

institutional memory, definition 988 

intelligence, definition 939 

intelligent agent, definition 5 1 , 43 1 , 
535, 932, 1431 

intelligent MAS 917 

intelligent query answering mecha- 
nism 924 

intelligent radar detectors 933, 934, 
935 

intelligent software agent, and e-com- 
merce 941-944 

intelligent software agent, definition 
955 
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intelligent software agents 951 

intelligent software agents, and e- 
commerce 945-949 

intelligent software agents, applica- 
tions in focus 950 

intelligent system 51 

intelligent systems 257 

intelligent traffic sign classifiers 956 

intelligent tutoring system (ITS), defi- 
nition 1257 

intelligent tutoring system, definition 
1184 

intelligent tutoring systems 138 

intelligent tutoring systems, NLP 
techniques in 1253 

intelligent voltage instability detec- 
tion system 1597 

Intellimetric 139 

intension aware techniques, defini- 
tion 500 

interactive configuration 401 

interactive systems, and uncertainty 
963-966 

interactive systems, managing uncer- 
tainties 1036 

interconnect challenges 1558 

internal robotics 1376 

international arbitrage, definition 
1003 

international bimetallism, definition 
1003 

international monetary system, defini- 
tion 1003 

interoperability, definition 1282 

intron, definition 773 

intuitionistic defuzzification 970 

intuitionistic fuzzification 970 

intuitionistic fuzzy components, 
modification of 970 

intuitionistic fuzzy image processing 
967 

intuitionistic fuzzy index, definition 
973 

intuitionistic fuzzy set, definition 973 

inverse perspective mapping 843 

inverse perspective mapping (IPM), 
definition 847 

inverted index, definition 660 

isomorhic (function), definition 1243 

isomorphic graphs, definition 709 

iSTART 1182 

iteratively reweighted least squares 
(IRL), definition 1144 



Jacobians, definition 1087 

job arrival integration mechanism 

857 
job elimination mechanism 857 
joint belief distribution 436 
joint camera calibration 843 
JONS WAP spectrum, definition 1608 

K 

Karhunen-Loeve, definition 553 
k-cross-validation, definition 560 
kernel density estimate, definition 

909 
kernel estimator, definition 1350 
kernel machine, definition 1523 
kernel methods, definition 1523 
kernel trick, definition 1 523 
kernel, definition 566 
kernelized fuzzy c-means (KFCM) 

712 
KFM, definition 1218 
KGP agents 88 
KGP model 88 
Khepera robot 821 
Kirchhoff 's Laws 1357 
k-lines 331 

K-nearest neighbour (KNN) 241 
K-NN, definition 566 
knowledge based robotics, definition 

608 
knowledge discovery in databases 

(KDD), definition 891 
knowledge discovery, definition 487 
knowledge engineering 106 
knowledge engineering, definition 

540 
knowledge extraction process 483 
knowledge extraction, definition 487, 

588, 939 
knowledge management systems, 

procedural development 

975-981 
knowledge management tool, defini- 
tion 988 
knowledge management, definition 

535, 988 
knowledge processing 926 
knowledge reduction, definition 1403 
knowledge society 532 
knowledge visualization, definition 

787 
knowledge, definition 988 



knowledge-based system, definition 

995 
knowledge-based system, types of 

990 
knowledge-based systems 989 
Kohonen maps 996 
Kolmogorov- Sinai entropy, definition 

1272 
Kripke model, definition 1 093 
Kullback Leibler distance (KL-dis- 

tance), definition 1257 



labeled transition systems, definition 

1093 
Lagrange multiplier, definition 676 
LAMP 1612 

language model, definition 1472 
language recognition 179 
latent semantic analysis (LSA) 1254, 

1257 
latent semantic analysis, definition 

1184 
lattice gas automata, definition 360 
lattice theory, neural/fuzzy computing 

1238 
lattice, definition 1 1 1 1 , 1 243 
leaf node, definition 703 
learn to learn, definition 535 
learning (training) rule, definition 

1063 
learning algorithm, definition 676, 

1011, 1017 
learning algorithms for recurrent 

networks 1412 
learning design, definition 1282 
learning domain models 1026 
learning domain- specific planners 

1025 
learning machine, definition 1466 
learning objects, definition 1282 
learning process 532 
learning rule optimization 577 
learning search control 1025 
learning stage, definition 866 
learning, active 1 
learning, game-based 140 
learning, mixed active 5 
learning, RETIN active 4 
learning, supervised 282 
learning-based planning 1025 
least mean square error reconstruc- 
tion (LMSER), definition 901 
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least squares support vector machine, 

definition 666 
least trimmed squares estimator 

(LTS) 1390 
least-squares algorithm, definition 

742 
LEGO Group 1385 
LEGO MINDSTORMS robots 1385 
lens distortion, definition 847 
lesson learned, definition 988 
levels, definition 1574 
Levenberg-Marquardt algorithm, 

definition 939 
leverage points, definition 1395 
lexical processing 106 
linear discriminant analysis (LDA) 

367 
linear equation, definition 676 
linguistic similarity techniques, defi- 
nition 500 
linguistic term, definition 703, 727 
linguistic variable, definition 703, 

780 
linguistic variables, definition 727 
linked concepts 172 
local classification 1358 
local maxima finding, definition 

1350 
local minima, definition 1079 
local navigation methods 1074 
local navigation methods, definition 

1079 
local search, definition 602, 1196 
logic of actions, definition 1 094 
logic of knowledge, definition 1094 
logic of time, definition 1094 
logic program, definition 403 
logic programming 991 
logistic models, definition 333 
logistic regression, definition 1144 
logistic representations 330 
look ahead, definition 409 
LS A cosine, definition 1258 
LS-SVM 664 
LTS algorithm 1390 
LTS error function 1389 
Lyapunov exponents, definition 1272 
Lyapunov function, definition 1224 

M 

machine learning (ML) 71, 423, 554, 

560, 666, 816, 823, 1567 
macro-action, definition 1028 



macroevolutionary algorithm, defini- 
tion 608 
MAGE, definition 1070 
MAGE-ML, definition 1277 
MAGE-OM, definition 1277 
MAGE-stk, definition 1277 
magnetic resonance imaging (MRI), 

definition 376 
magnetoencephalography (MEG), 

definition 376 
majority (gate), definition 479 
Mamdani fuzzy rule-based system, 

definition 766 
Mamdani inference method 813 
Mamdani inference system, defini- 
tion 766 
Manhattan distance, definition 1079 
many-objective problem, definition 

1048 
many-valued logic 1610-1614 
mappings between ontologies 494 
marginal belief distribution, defini- 
tion 436 
marked graph, definition 687 
Markov Chain Monte Carlo 

(MCMC), definition 1488 
Markov decision process, definition 

830 
Markov decision processes (MDPs) 

825, 826 
Markov process, definition 1035 
Markov switching model, definition 

1003 
Martin, F. 1383 
MAS 918 

mass customization, definition 403 
mass spectrometry, definition 1342 
mass spectrometry, wavelet transfor- 
mation in 1338 
MaxCut problem, definition 1 120 
maximum entropy 1564 
maximum entropy model (MEM), 

definition 1567 
MaxQ 828 
MDS, definition 566 
medical image, definition 718 
medical images 1583 
Mel Feature Cepstral coefficients 

(MFCC), definition 795 
membership function, definition 703, 
709, 718, 727, 780, 802, 974, 
1129 
membrane computing 1174 
membrane systems 1174 



memory and learning 1056 
memory hierarchy, definition 506 
memory organization packets 

(MOPS) 331 
Mercer's kernel, definition 1523 
metacognition, definition 342 
metadata, definition 1282 
metaheuristic, definition 487 
meta-heuristics 854 
metaheuristics, definition 807, 859 
meta-heuristics, definition 
metric-drive design, definition 430 
Metropolis-Hastings algorithm, defi- 
nition 1489 
MGED, definition 1070 
MIAME, definition 1070,1277 
micro-array data integration 1065 
micro-array data sources 1065 
microarrays 65, 1277 
microarrays, ontologies for 1273 
microarrays, processing patterns for 

1273 
microstructure, definition 617 
middle-agents, definition 955 
minor component (MC), definition 

901 
minority-3 gate, definition 1480 
Minority-3 Gate, definition 1561 
mirror to camera positioning 842 
mismatch, definition 1481,1561 
mixed active learning 5 
mixture-of-expert (ME) models 318 
MLP learning, stochastic approxima- 
tion Monte Carlo 1482 
mobile ad-hoc network, definition 

595 
mobile agent (MA) 1616 
mobile code 1406 
mobile robots localization 1072, 

1080 
mobile robots mapping 1072, 1080 
mobile robots navigation 1072, 1080 
modal analysis, definition 553 
modal logics 1089 
model checking problem, definition 

1094 
model checking, definition 1554 
model checking, definition 506 
model evidence, definition 1489 
model functioning 379 
model selection, definition 525 
model-based reasoning, definition 

995 
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modular neural networks 1096 
modularity 1096 
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modularity, implementing 1097 
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1097 
modularization, definition 1101 
molecular syntax 1174 
Mondada, Francesco 605 
monotonicity, definition 333 
Monte Carlo simulations, definition 

1481, 1561 
Monte Carlo, definition 479 
Moore's gap 1503 
morphological filter, definition 1111 
morphological filtering, principles of 

1102 
morphological filters, basic 1106 
morphological operators, definition 

389 
morphological processing 106 
morphological pyramids 1108 
MOS transistors 1475 
MREM 1112 
multi agent systems 924 
multi-agent system 52 
multi-agent system (MAS), definition 

955 
multiagent system, definition 923, 

1094 
multi-agent systems (MASs) 952 
multi-agent systems, definition 859 
multiagent systems, modal logics for 

reasoning 1089 
multiarity relation, definition 709 
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definition 525 
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multi-layer perceptrons (MLPs), 

definition 580, 1424 
multi-layered concept models 1132 
multi-layered data model, definition 

1135 
multi-layered nature of human recog- 
nition 1132 
multi-layered schemas 1131 
multi-layered semantic data models 

1130 
multi-layered semantic models 1131 
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1108 
multilogistic regression 1137 
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units 1136 
multimodal problems 648 
multimodal problems, characteriza- 
tion of 647 
multimodal problems, definition 653 
multimodal problems, solutions with 

GA 647 
multi-modal system, definition 1265 
multimodality, definition 342 
multi-model optimization problem 

521 
multi-objective algorithm, definition 

1595 
multi-objective evolutionary algo- 
rithms 1145 
multi-objective optimization 1590 
multi-objective optimization, defini- 
tion 1150, 1158, 1196 
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multiple dam scheduling 803 
multiple dam scheduling, definition 

807 
multiple dam scheduling, harmony 

search for 803 
multiple layer perceptron (MLP), 

definition 1489 
multiple testing 66 
multiplexing (von Neumann), defini- 
tion 479 
multiplicative monotonic cooling 

345 
multi-scale transformation, definition 

1111 
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mutation 593 
mutation operator 651 
mutation, definition 595 
mutation, definition 653, 758 
mutual information projections, defi- 
nition 909 
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879 
naive physics 334 
Nai've-Bayes, definition 546 
named entities 106 
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1172 
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n-ary languages, definition 1172 
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1384 

Nash equilibria, non-cooperative 
games 1018 

Nash equilibria, software agents 
1019 

Nash equilibrium, definition 1023 
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1178 

natural language processing 334 

natural language processing (NLP) 
795, 1562, 1567 

natural language processing, defini- 
tion 1184 
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natural language understanding and 
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natural languages 1174 
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negotiation protocol, definition 1424 

negotiation strategy, definition 1424 
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algorithms 1191 
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network visualization 782 

networks of evolutionary processors 
1175 

neural architecture 1417 

neural classifier, definition 1496 
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1490, 1497 

neural controller, definition 1101 

neural MREM model 1113 

neural network (NN) 554, 1424 

neural network based visual data min- 
ing 1205 

neural network framework (FNN) 32 
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sis 1212 
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tiations 1524-1529 
neural networks, multi-objective 

training 1152 
neural/fuzzy computing 1238 
neural-fuzzy systems (NFS) 32 
neurocomputing, and rich dynamics 
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neuro-fuzzy system, definition 1403 
neuro-fuzzy technology 31 
neuro-fuzzy, definition 923 
neuromorphic modeling 223 
neuron model 639 
neuron model, definition 1017 
neuron modeling 1057 
neuron, definition 1481, 1561 
neurons 1558 
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new technologies feature 534 
new technologies proposal 533 
new technologies, definition 535 
new technologies, e-learning in 532 
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n- gram model, definition 1472 
NKRL inference engine, definition 

1166 
NKRL inference rules, definition 

1166 
NKRL templates, definition 1165 
NLP techniques 1253 
NNs, learning techniques of 1435 
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Nogood, definition 513 
noise 99 
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noisy text 99, 105 
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identification systems 1259 
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1023 
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tem, definition 1265 
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non-monotonic adaptive cooling 347 
non-monotonic logic 331, 333, 1041 
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1370 
non-rigid objects 1363 
non-rigid objects, definition 1369 
note-taking techniques 337 
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OAA approach 557 
object representation, definition 1369 
object tracking, definition 1369 
objective fitness function 592 
objective function, definition 1151 
odometry, definition 1087 
off-chip training, definition 839 
off-line system, definition 1454 
off-line writer identification system 

1448 
Ohm's Laws 1357 
omnidirectional camera, definition 

852 
omnidirectional vision 849 
on-chip training, definition 839 
one against all (OAA) model 556 
one against all approach, definition 

560 
online learning, definition 1028 
online noisy documents 99 
on-line system, definition 1454 
ontologies for education 1278 
ontologies for learning design 1278 
ontology 1283-1289 
ontology alignment 1283-1289, 

1290-1295 
ontology alignment systems 1285, 

1290-1295 
ontology alignment techniques 

1290-1295 
ontology alignment, definition 500 
ontology integration, definition 500 
ontology language, definition 1282 
ontology mapping techniques 495 
ontology mapping, definition 500, 

1054 
ontology mediation, definition 500 
ontology merging, definition 500 
ontology of concepts, definition 1166 
ontology of events, definition 1166 
ontology, definition 923, 1054, 1071, 

1135, 1277, 1282 
OpenCV 1584 

operant conditioning, definition 1204 
optic flow, definition 1190 
optical character recognition (OCR) 

105, 231 
optical devices 131 
optimal 2-D linear prediction filter 

design 1434 
optimal coefficient linear filters, defi- 
nition 1438 
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optimality, conditions of 1043 

optimality, definition 513 

optimization problem, definition 487 

optimization, definition 807 

Oracle Text 657 
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orthogonal least squares (OLS) algo- 
rithm, definition 1362 

orthonormalization 663 

oscillations, definition 1225 

outlier, definition 1395 

out-of- vocabulary rate, definition 
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over-constrained equations, definition 
1350 

over-fitting, definition 666 
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parallel computing 873 
parallel genetic algorithm 1298 
parallel hybridization 1193 
parallel metaheuristics, definition 

487 
parallel processing 428, 945 
parameter control 488 
parameter identification, definition 

742 
parameter tuning 488 
parameter variations, definition 

1481, 1561 
pareto front, definition 1048 
pareto optimal solution, definition 

1048 
pareto set, definition 1048 
particle swarm optimization (PSO) 

1584 
particle swarm optimization, defini- 
tion 1309 
particle swarm optimization, image 

analysis 1303, 1304 
pasta segmentation 1306 
path, definition 703 
pattern classification 813 
pattern database, definition 1554 
pattern packing, definition 1554 
pattern recognition 110, 423, 816, 

1358 
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payoffs, definition 1023 

PCA model 22-30 

peak period, definition 1 609 
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pedagogical principles, definition 
343 

perceptron, definition 1481, 1561 

perceptron-based adders, statistical 
simulations on 1474 

perfect recall synchronous environ- 
ment, definition 1094 

perfect recall synchronous system, 
definition 1094 

perplexity, definition 1472 

PerPot, definition 1218 

personalized decision support sys- 
tems 1310-1315 

perspective reprojections 843 

petri nets, definition 687 

pharmacokinetics (PK) 71 

phase portrait, definition 1225 

phenotype, definition 617, 624 

pheromone, definition 1535 

photo-multiplier tubes (PMT) 1576 

phylogenetic trees 242 

pin-hole camera, definition 847 

pitch 1441 

pitch, definition 795 

plain conditionalization 1352 

plate detection 1307 

plateau, definition 1028 

platform, definition 343 

pointwise ranking function, definition 
1355 

policy, definition 1028 

Politree partitioning, definition 823 

polyacrylamide gel electrophoresis 
(2D-PAGE) 158^1588 

polyespectra, definition 1272 

population based algorithm, defini- 
tion 1151 

population size, definition 574 

population, definition 574, 624 

population-based algorithm, defini- 
tion 1196 

ports and coasts engineering, genetic 
fuzzy systems 759 

poset, definition 1243 

positive valuation (function), defini- 
tion 1243 

positron emission tomography (PET) 
1576-1582 

possibility theory, definition 1041 



posterior belief, definition 1087 
power network 1358, 1360 
power quality evaluation 1226 
power quality, definition 1231 
power system model 1356, 1357 
power system stability 1596-1602 
power system state estimation, defini- 
tion 1362 
power system topology error, defini- 
tion 1362 
power system topology model, defini- 
tion 1362 
power system topology verification 

1356 
power system topology verification, 

definition 1362 
power system topology verification, 

RBF networks 1356 
precision farming, definition 1144 
precision, definition 546, 1054 
predicate calculus 329 
predicted belief, definition 1087 
prediction error, definition 1438 
predictive analysis, definition 423 
pre-processing module 856 
pre-processing procedures 880 
preprocessing, definition 962 
preservation of topology, definition 

1035 
principal component analysis (PCA) 

368, 883, 901 
principal component analysis, defini- 
tion 371, 1237 
Principle of Cognitive Scaffolding 

467 
Principle of Incremental Complexity 

466 
Principle of Information Self-Struc- 
turing 466 
Principle of Interactive Emergence 

467 
principle of irrelevance of syntax, 

definition 1374, 1432 
principles for developmental systems 

466 
privacy, definition 1329 
privacy-preserving data mining 1326 
privacy-preserving data mining 

(PPDM), motivation for 1324 
privacy-preserving estimation (PPE) 

1325 
proactive forward ant, definition 
1536 



probabilistic latent semantic analysis 

(PLSA), definition 1258 
probability density function, defini- 
tion 939 
problem instance, definition 487 
product unit neural networks 1137 
product unit neural networks, defini- 
tion 1144 
production rule, definition 995 
production rule-based system 396, 

403, 1335 
Programmable Brick 1383 
projection, definition 436 
propositional logic formula satisfi- 
ability, definition 403 
propositional models 328 
propositional models, definition 333 
prosody, definition 795 
protein secondary structure 1331 
protein structure prediction 1330 
protein, definition 382, 1335 
proteomic data, analysis of 1340 
protocols, definition 1258 
prototype classifiers 1339 
prototype classifiers, definition 1342 
prototype, definition 1252 
pruned search 584 
PSO for object detection 1304 
PSO for object segmentation 1304 
PTZ camera, definition 852 
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quantization error, definition 1252 
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1302 
query answering, definition 932 
query processing 927 
query refinement engine, definition 
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query, definition 932 
quiescence, definition 513 
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radar, definition 939 

radial basis function (RBF), defini- 
tion 1424 

radial basis function network, defini- 
tion 1362 
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random deviate generation process, 

definition 1466 
random network, definition 360 
random sampling, definition 1350 
randomized hough transform (RHT) 

1343, 1344 
ranking engine, definition 660 
ranking function, definition 1355 
ranking scheme, definition 1048 
Rayleigh wave (R-wave) 199 
RBF networks 1356 
RBF networks, learning in 1013 
RBF NN, definition 1424 
RDF, definition 1135 
reactive forward ant, definition 1536 
real time applications, updates 1427 
real-time recurrent learning 1417 
real-time system, definition 923 
recall, definition 546, 1054 
recommender system 71 
recommender system, Web 71 
reconfigurable circuit, definition 617 
reconfigurable gates network 611 
reconstructed phase space, definition 

1272 
recurrent networks, architectures of 

1411 
recurrent neural network (RNN), defi- 
nition 1225, 1411, 1417 
recursive algorithm, definition 1087 
recursive auto-associative memory 

515 
recursive auto-associative memory 

(RAAM), definition 519 
redundancy (factor), definition 479 
reflection, definition 1609 
regeneration mechanisms 857 
regular lattice, definition 360 
regularization, definition 1158, 1523 
reinforcement learning hierarchical 

neuro-fuzzy models 818 
reinforcement learning, definition 

824, 830 
relational product, definition 1554 
relaxation, definition 1063 
relevance learning, definition 1342 
reliability, definition 479 
RELIEF algorithm 634 
remote sensing, application to 1140 
remote sensing, definition 1144 
Renyi entropy, definition 909 
repair inconsistent database 1428 
replacement 594 



representation formalisms, definition 

333 
representation step, definition 1252 
response, definition 1574 
Rete algorithm 1404 
Rete-OO algorithm 1404 
RETIN active learning 4 
Reynolds Number(Re), definition 

1575 
RHT characteristics 1345 
RHT general form 1346 
RHT mechanisms 1345 
rival penalized competitive learning, 

definition 901 
RL-HNFB architecture 818 
RL-HNFB learning algorithm 819 
RL-HNFP architecture 818 
RL-HNFP learning algorithm 819 
RMSE, definition 1438 
RNFS, architecture of 1397 
RNFS, supervised learning process 

of 1399 
RNN training 1415 
roadmaps 1073 
robot system, collision-avoidance in 

459 
robotics 917 

robots, and competition 1385 
robots, and research groups 1384 
robots, in education 1383-1388 
robots, swarm 1537-1542 
robust estimator, definition 1395 
robust learning algorithm 1389, 1395 
robust LTS learning algorithm 1390 
robust statistics, definition 1395 
robustness, definition 1063 
root node, definition 703 
rough set theory, definition 780 
rough set, definition 1403 
rough set-based neuro-fuzzy system 

1396, 1397, 1403 
route discovery 1532 
route maintenance 1534 
RTL, definition 839 
rubble-mound breakwater, and AI 

144-150 
rule base decomposition 458 
rule base identification, issues in 460 
rule engines 1404-1410 
rule engines, and agent-based systems 

1404-1410 
rule induction, definition 1243 
rule reduction, definition 1403 
rule-based systems 990, 1404 



rule-driven devices 38 
rule-driven) devices, adaptive 38 
run time reconfiguration strategy 836 
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SAMANN neural networks, defini- 
tion 1211 

SAMC algorithm 1483 

SAMIDI 1066 

SAMIDI approach 1066 

SAMIDI software architecture 1068 

SAMIDI, data integration 1064 

SAMIDI, micro-array information 
1064 

Sammon error, definition 1211 

Sarsa, definition 824 

SBS, definition 638 

scaffolding, definition 470 

scheduling problem 853 

scheduling, definition 859 

scripts 329 

search algorithm, definition 506 

search algorithms, definition 409 

search control knowledge, definition 
1028 

search space, definition 493, 525, 

580, 588, 602, 624, 653, 1151, 
1196, 1302, 1466 

secondary population 650 

secondary structure, definition 1335 

second-order method, definition 1011 

second-order representations 433 

sectioner, definition 660 

security threats, and mobile code 
1406 

segmentation, definition 376, 423 

segmentation, definition 718,1309 

selection criteria 594 

selective pressure, definition 1048 

self-adaptation 488 

self-adaptation, definition 493 

self-adaptive control parameters 490 

self-configuration 591 

self-explanation and reading strategy 
trainer (SERT), definition 
1258 

self-healing 591 

self-management 591 

self -optimization 591 

self-organising neural networks, defi- 
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self-organization principle, definition 
1063 

self-organization, definition 360 

self-organized criticality, definition 
360 

self-organizing map (SOM), defini- 
tion 1252 

self-organizing map for dissimilarity 
data 1244 

self-organizing map, definition 787 

self -organizing maps 781 

self-organizing maps, definition 795, 
1035 

self-organizing structures 377 

self-splitting clustering 291 

self-tuning regulator, definition 923 

semantic data model, definition 1135 

semantic information management 
methodology (SIMM) 1067 

semantic information model method- 
ology, definition 1071 

semantic nets 329 

semantic similarity techniques, defi- 
nition 500 

semantic structure, mapping ontolo- 
gies 1049 

semantic Web, definition 1054 

semi-Markov decision process, defi- 
nition 830 

semiotic dynamics, definition 470 

sensor calibration 842 

sensor discredibility detection method 
570 

sensor discredibility, definition 574 

sensor node, definition 758 

sensors, definition 955 

sequence alignment 242 

sequence processing 1411, 1417 

sequential backward selection (SBS) 
638 

sequential forward selection (SFS) 
638 

serial hybridization 1193 

SET, definition 1561 

SFS, definition 638 

Shafer-Dempster's evidence theory, 
definition 1041 

shallow knowledge, definition 995 

short message services 99 

signal models 934 

signal processing, definition 872 

signals characterization 1266 

signals characterization, nonlinear 
techniques 1266 



signed formulae 1426, 1428 
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significant wave height, definition 

1609 
similarity measure, definition 1 054 
similarity measures 797 
similarity, definition 802 
simulated annealing 344-352 
simulated annealing algorithm 344, 

569, 1335 
simulated annealing, definition 1489 
simultaneous head and facial action 

tracking 625 
single viewpoint constraint, definition 

847, 852 
six-valued logic 1610 
skeletons 111 

small- world network, definition 360 
SMDPs 826 

snapshot image, definition 1 1 90 
social insect behaviour 1537 
social robot, definition 63 1 
soft computing, definition 780, 807, 

872 
soft-computing, definition 916 
solar radiation data forecasting results 

1435 
solar radiation data, 2-D representa- 
tion of 1433 
solar radiation forecasting model 

1433 
solar radiation, definition 1438 
SOM algorithm, definition 1003 
SOM batch algorithm, definition 

1252 
SOM, definition 1243 
SOM, labour market data 1029 
sonic crystal, definition 1302 
sonic crystals 1297 
spam filtering 561 

sparse fuzzy rule base, definition 733 
species, definition 653 
spectometric data 662 
spectroscopy, definition 588 
spectrum, definition 588 
speech recording 1439 
speech-based clinical diagnostic 

systems 1439-1446 
spelling error correction 105 
Spohn conditionalization 1353 
SPSEC algorithm 1399 
stability, definition 1225 
Stable Herbrand Model, definition 

403 
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standard genetic algorithm 568 

star schema, definition 430 

state estimation, definition 687 

state machine, definition 687 

state- space generalization, definition 
830 

statistical disclosure control (SDC), 
definition 1329 

statistical disclosure limitation 
(SDL), definition 1329 

statistical learning, definition 376 

stemmer, definition 660 

stereo vision, definition 852 

stigmergy, definition 1536 

stochastic approximation algorithm, 
definition 1489 

stochastic information gradient, defi- 
nition 909 

stochastic search, definition 1466 

stochastic universal sampling (SUS) 
1299 

stream processing, definition 1496, 
1503 

stream programming model 874 

string similarity techniques, defini- 
tion 500 

strong equivalence, definition 1432 

structural concrete field 526-53 1 

structural concrete, and ANN 
119-124 

structural imaging 372 

structural risk minimization (SRM) 
inductive principle, definition 
1523 

structure aware techniques, definition 
500 

structure identification, definition 
742 

structure preservation 1590 

structured information 106 

subattice, definition 1243 

subharmonics 1443 

submerged breakwater domain 761 

submerged breakwater, definition 
1609 

submerged breakwaters 1603 

submerged breakwaters, wave reflec- 
tion 1603 

sub-swarm, definition 1309 

sub- word unit, definition 1472 

sub-word units 1468 
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supervised classification, definition 
1265, 1454 

supervised learning 282, 816, 1517 

supervised pseudo self-evolving cer- 
ebellar (SPSEC) 1399 

support vector machine (SVM) 1-8, 
525, 666, 916, 1518 

support vector regression (SVR) 
model 411 

surrogate fitness, definition 795 

survival analysis 390 

sustained sound 1439, 1440 

SVD, definition 566 

SVM classifiers 561 

SVM, definition 546, 566 

SVM, EA multi-model selection for 
520 

SVRCACO 410 

swarm intelligence (SI) 10, 238, 
1309, 1531, 1536 

swarm intelligence approach for ad- 
hoc networks 1530 

swarm intelligence for visualisation 
537 

swarm robotics 1537-1542 

symbol grounding problem 1543- 
1548 

symbol grounding problem, and 
semiotics 1545 

symbolic breadth-first search 1550 

symbolic computation 991 

symbolic pattern databases 1551 

symbolic search 1549 

symbolic search algorithms 1550 

synchronization, definition 1225 

syntactic analysis 106 

syntactic parsing 1184 

synthetic neuron implementations 
1555 

system engineering 918 

system engineering and robotics 917 

system modeling 51, 743 

system monitoring, definition 687 

system, definition 932 



Tabu search, definition 859 
Takagi-Sugeno inference method 813 
Takagi-Sugeno method 813 
Takagi-Sugeno-Kang fuzzy rule- 
based system, definition 766 
temporal logic, definition 1094 
tensor products, definition 519 
test, definition 1218 



text categorization (TC) 655 

text mining 654 

text normalization, definition 795 

text summarization 656 

theory of endorsement, definition 
1041 

thermal equilibrium, definition 1466 

thesaurus, definition 660 

threshold based voting, definition 
1350 

time reduction 876 

time-series prediction, definition 
1158 

time-series, definition 1158 

tokenisation, definition 1054 

tomography, definition 376 

topology deformations 1366 

topology errors (TEs) 1356 

topology preservation, definition 
1252 

topology preserving graph, definition 
1369 

total least square (TLS) fitting, defini- 
tion 901 

tracking, definition 553 

trade-off constant of SVM, definition 
525 

traditional manner of study, defini- 
tion 343 

traffic control 410 

traffic light control 458 

traffic sign pre-processing 555 

traffic sign recognition 554 

traffic sign recognition, ensemble of 
ANN 554 

traffic sign recognition, neural net- 
work system 555 

training algorithm 1417, 1218 

transaction identifier 173 

transfer function optimization 577 

transform domain system, definition 
1265 

transient, definition 1231 

travelling salesman problem 1114, 
1120 

tree augmented Naive Bayes (TAN) 
880 

tree pruning 438 

tree selection 438 

tree splitting 438 

tree- structured graph, definition 1054 

truth maintenance system, definition 
1041 

TS algorithms 996 



TSL color space, definition 1503 
TTS synthesis 788 
tumor prediction 304-311,312-317 
two-phase hybridization 1192 
type, definition 1218 
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ubiquitous computing (UbiComp) 93 

UCE, definition 566 

UCI repository, definition 883 

ultra low power neurons 1556 

UML, definition 1135 

unbalance index, definition 1362 

unbalance indices 1357 

uncertainty, and interactive systems 
963-966 

uncertainty, sources of 963 

unconditioned response (UR), defini- 
tion 1204 

unconditioned stimulus (UCS), defi- 
nition 1204 

under-constrained equations, defini- 
tion 1350 

unifying information model (UIM) 
1066 

unifying information model (UIM), 
definition 1071 

uniqueness, definition 676 

unit selection synthesis, definition 
795 

unknown word, definition 1472 

unsuccessful negotiation threads 
(UNTs) 1422 

unsupervised learning, definition 
463, 787, 795, 1595 

update process 1426 

update theory 1371 

update, definition 1375, 1432 

updates, roadmap of 1370 



vague environment (VE), definition 
733 

vague language, use of 992 

value iteration, definition 506 

Value Principle 466 

variable scaling 664 

variable selection 458 

variable selection approaches 583 

variable selection, definition 588, 
666 

variable selection, evolutionary ap- 
proaches to 581 
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variant SNR environments 719 
variant SNR environments, fuzzy 

logic estimator for 719 
VC dimension, definition 1523 
vector field histogram 1075 
vector symbolic architecture (VSA) 

516, 519 
version space theory 1 
VHDL, definition 839 
video image post-processing 385 
video-based face recognition 1455- 

1461 
videometrics 389 
videometrics, definition 389 
virtual and immersive environments, 

definition 540 
virtual reality, definition 1211, 1 595 
virtual theatres, definition 540 
virtual, definition 535 
visual data mining 1205 
visual homing, definition 1190 
visual servoing, definition 847 
visualization methods 783 
vocabulary, definition 1473 
voice quality 1439, 1443 
voltage instability detection, using 

neural networks 1596-1602 
voltage output image processing 

1597 
von Neumann multiplexing 471 
Voronoi diagrams 12 
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wave flume 384 

wave flume experiments 383 

wave reflection at submerged break- 
waters 1603 

wavelet analysis, definition 1342 
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